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ABSTRACT 


Reduced precision redundancy (RPR), as a new method for improving fault tolerance in 
FPGAs, appears promising in replacing triple modular redundancy (TMR) to prevent the 
single event effects due to radiation in arithmetic processes. As a test of this approach, 
the RPR technique was used to implement a Radix-4 fast Fourier transform (FFT). This 
design was implemented in a Xilinx Virtex 2 FPGA in order to find the possible gain in 


speed and power as compared to the TMR method. 


This thesis deals with a 64-point Radix-4 in-place FFT, based on an improved 
FFT algorithm. The whole FFT structure was implemented based on self-designed 
modules and by manipulating the embedded Virtex If FPGA’s modules. The point was 
to create a fast and small FFT module that could be altered according to specific 
application requirements. The implementation of the FFT was successful, managing to 


handle data in real time at a speed of 134MHz. 


Based on this FFT design, the next challenge was the implementation of TMR and 
RPR modules. The first attempt was the TMR structure, implemented by creating three 
identical replicas of the FFT and installing a voter per FFT stage. This implementation 
was unsuccessful due to space limitations. The next step was the alteration of the 
existing FFT and the creation of a smaller 8 x 8 bit butterfly module for the RPR 
structure. After the successful completion of this step, implementation of a RPR module 
with an 8/32 degree was commenced. Ambiguities and inefficient radiation protection 
were identified in this implementation. Finally, adopting a new RPR approach and a 
higher degree of 14/32, a smooth and correct RPR module was created that could work in 
real time, and handle data at a speed of 163MHz. Both TMR and RPR with a degree of 
14/32 methods were compared, confirming the RPR’s advantage in power consumption 


and in occupied FPGA’s resources. 
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EXECUTIVE SUMMARY 


Field Programmable Gate Arrays (FPGA) are integrated circuits containing 
programmable logic components and interconnectors that are able to be used for the 
creation of complex logic functions. They are characterized by the unique ability to be 
updated or changed depending on current requirements. One of the main applications for 
FPGAs 1s in the space industry where arrays must be updated or reconfigured without 
being physically accessed. Difficulties arise in the use of such devices in the harsh space 
environment where guard methods against single event effects (SEE) from radiation are 


required. 


A simple solution to this problem is the use of the Triple Modular Redundancy 
(TMR) method. This is a process that protects the whole FPGA configuration against 
SEE, with the compromise of demanding a significant amount of resources. Snodgrass 
recognized the problem and introduced, in his PhD dissertation in 2006 [1], a new 
method of fault tolerance. This method, referred to as the Reduced Precision 
Redundancy (RPR) method, is an alternative way of implementing and protecting the 
required design. It can be used only for arithmetic processes and requires fewer 
resources than TMR. The use of RPR requires a compromise between capacity demand 
and output’s precision, depending on the “degree’”—the measure of reduction of precision 


—of RPR. 


The objective of this thesis was the creation of a fast Fourier transform (FFT) 
structure that could be implemented in a FPGA of the Virtex II family, adopting both 
methods of redundancy, TMR and RPR, in order to investigate the performance of RPR. 
First, a simple 64-point Radix-4 in-place FFT was implemented that could handle fixed- 
point two’s complement numbers. This structure was tested with accurate results. Next, 
a TMR structure was designed by replicating three identical FFT structures and by 
importing a voter into the end of each stage. The design was successful, but the 
implementation failed to fit within due to size constraints of the Virtex IJ FPGA, 
revealing the significant demand for resources of the TMR method. The next step was 


the design of a RPR structure with a degree of 8/32 — 8 bits reduced precision and 32 bits 


xii 


precise result. The design was successfully implemented, but the protection against 
radiation failed due to ambiguities and errors that were not considered at that time. 
Taking into consideration the problems from the previous unsuccessful design, a new 
RPR_ structure with a degree of 14/32 was designed and implemented. This 


implementation worked correctly and protected the FFT structure efficiently. 


Based on the research conducted in this thesis, an alternative RPR method is 
suggested, where there is no actual need for generating upper and lower bounds. Instead, 
the truncation and duplication of the precise number, in combination with theoretical 
boundary calculations, 1s sufficient. This alteration assists in simplifying the logic and 


decreasing the size of the voter. 


Finally, both TMR and RPR methods were implemented successfully 1n a slightly 
larger Virtex II demonstrating the advantage of RPR over TMR in resource requirements 


and power consumption. 
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I. INTRODUCTION 


Modern satellites are capable of handling controls, communications, observation 
systems, and on-board payload data processing tasks. The inaccessibility of a satellite 
after launch and the need for periodic updates of the satellite’s data handling procedures 
create a strong argument for using field programmable gate array (FPGA) instead of 
application-specific integrated circuit (ASIC) technologies. The fact that a FPGA can be 


reconfigured any time it needs to be is a major advantage and the main reason that they 


are preferred in a spacecraft circuit design. 


Spacecraft computer systems must be able to operate reliably, despite the harsh 
radiation environment. High energy protons from the Van Allen radiation belt, cosmic 
rays from outer space, and highly charged ions from solar flares are just a few of the 
threats that need to be taken into consideration [2] in spacecraft circuit design. In order 
to prevent radiation effects, hardening of the device is a significant priority. But, 
continuously decreasing transistor sizes and the simultaneous increase in operating 


frequencies are obstacles to our hardening efforts. 


Radiation can cause unwanted effects, such as the flipping of a memory cell’s 
state in semiconductor devices, better known as soft errors or single event upsets (SEU). 
FPGAs are susceptible to errors in both data and architecture configuration caused by 
SEU. In order to prevent this faulty behavior, triple modular redundancy (TMR) is 
commonly used to ensure reliable operation. However, TMR is very costly in terms of 
chip area and power consumption. Current research is focused on introducing new 
methods of fault tolerance for space-borne reprogrammable computers. At the Naval 
Postgraduate School (NPS) in 2006, a new method of fault tolerance was introduced [1], 
referred to as reduced precision redundancy (RPR). RPR applies redundancy only to the 
most significant numerical bits of a circuit and in this way significantly decreases the 


needed chip area and power consumption over that required for TMR. 


In this thesis, a Radix-4 64-point FFT is implemented in a Virtex I] XC2V6000 
FPGA chip using a Xilinx interface in order to further examine the effectiveness of the 
RPR method. The reason for the choice of this chip was its existence in the space-flight 


prototype processor, CFTP-2 [3]. 


A. OBJECTIVE 


Reduced precision redundancy (RPR), a new method of fault tolerance in digital 
arithmetic processors, appears promising as a technique for replacing triple modular 
redundancy (TMR) against single event effects due to radiation. In order to demonstrate 
this promise, the RPR technique is used to implement a Radix-4 fast Fourier transform 
(FFT). This design is implemented with different degrees of RPR in a Xilinx Virtex II 
FPGA in order to find possible improvements in chip area, speed, and power 


consumption compared with the TMR method. 


B. FFT DESIGN OVERVIEW 


The basic design goal is to implement a FFT design based on the RPR concept in 
a Virtex2 XC2v6000. This efforts focuses on the creation of one FFT Radix-4 N = 64 
point that can handle one 32-bit input signal in every clock period (in real time) and at the 
maximum possible frequency, combined with two RPR modules of the same philosophy 


with a degree of 8/32 = 0.25. 


C. BACKGROUND 
i Space Environment and FPGA 


The space environment is harsh and has serious impact on any spacecraft that 
orbits, even for a short period of time. Thermal imbalances, erosion or surface damage 
are just a few of the threats. In this thesis, efforts to prevent or decrease the impact of a 


different threat, space radiation, are examined. 


The space radiation environment is encountered by the majority of space missions 
and is the result of galactic cosmic rays (GCRs) of particles emitted by solar events and 


of particles trapped in the Earth’s radiation belts [2]. GCRs are highly energetic, heavy 


protons and ions, reaching energies in excess of 10 GeV/nucleon. The GCR environment 
in interplanetary space changes with the phase of the solar cycle. Particles emitted by 
solar events are an important contribution to radiation and are correlated with the eleven- 
year solar cycle. This is due to the fact that large solar events are more frequent during 
solar maxima than during solar minima. Particles trapped in the Earth’s radiation belts, 
sometimes called the Van Allen belts, are most significant between the altitudes of 
approximately 1,000 km to 32,000 km. These particles consist of electrons, protons and 
heavy ions that are trapped in the Earth’s magnetic field. Spacecraft shielding is capable 


of protecting against only some of these particles [2]. 


This radiation and the inefficiency of our shielding can cause unwanted effects on 
space-borne circuits, better known as single event effects (SEE). SEE can take many 
forms and some of them can be more destructive than others. In this thesis, the non- 
destructive SEE, single event upset, known as a soft error is considered. A soft error is 
the transient corruption of a single bit of data. Unfortunately, FPGAs are susceptible to 
soft errors, in both data and the architecture configuration, both of which are stored in 


memory. 


Zz Fault Tolerance Methods—Redundancy 


One way to maximize a system’s fault tolerance is to use triple modular 
redundancy (TMR). The basic design of TMR consists of three identical copies of the 
operation that 1s required to be “secure.” The three identical system results are processed 
by a voting system to produce a single output. If any one of the three systems fails, the 
other two systems can correct and mask the fault. The disadvantage of using TMR is the 
high chip area occupation and the increased power consumption [4]. Snodgrass [1] 
suggested the concept of a new fault tolerance method, the reduced precision redundancy 
(RPR) method that allows a sacrifice in level of precision 1n an arithmetic calculation, in 
return for decreased power and capacity demand. Instead of implementing three identical 
copies of a circuit and voting the result, one fully functional copy of the circuit and two 
reduced precision copies, that will define the upper and lower limit of the function’s 


output, are created. The voter then compares the three values and checks to determine if 


the function’s result is within the limits of the reduced precision values or if an error has 
occurred. In either case a result can be generated. Depending on where the exact point 


of the error is, the result will be a precise or less-precise output calculation. 


D. ORGANIZATION OF THIS THESIS 


Chapter II, Previous Work, describes all thesis and various previous work on 


which this thesis 1s based. 


Chapter III, Theoretical Approach, describes the theoretical background and the 
algorithm choices that rising for the design of a FFT. 


Chapter IV, Design Implementation Details and Description of the 
Implementation Effort, provides a description of the modules and data management 
strategies used to develop the FFT design and explains the implementation efforts for 


TMR and RPR modules. 


Chapter V, Results, contains an analysis of the occupied resources and power 


consumption for both TMR and RPR modules. 


Chapter VI, Conclusion and Recommendations, contains a summary of the total 


effort and recommendations for future work. 


I. PREVIOUS WORK 


A. PREVIOUS THESIS 


This thesis 1s based on the dissertation of Joshua D. Snodgrass [1] and on two 
theses, from Nikolaos Gkikas [5] and Margaret A. Sullivan [4], respectively. These 
works reveal valuable information and results that are combined in the design efforts 


examined in this thesis. 


1. Gkikas Thesis 


Gkikas describes the design of a Radix-4 FFT for wireless communications 
(wireless local area networks (LANs)) [5]. He compares different algorithm structures, 
the complexity, memory needs, data flow and controller complexity of each of them, 


searching for the most suitable algorithm for his intended application. 


First, he analyzes the difference between the decimation Fourier transform (DFT) 
and FFT, Radix-2, Radix-4 and Split-Radix algorithms. Based on theoretical approaches, 
he reports that the FFT provides a faster solution than the DFT. Comparing Radix-2 to 
Radix-4, he concludes that Radix-4 requires fewer multiplications, so it is the preferred 
solution. Investigating the Split-Radix algorithm, he calculates that 1t requires even fewer 
multiplications than the Radix-4, but it has the disadvantage of greater complexity. 
Continuing his investigation, he states some useful information about the decimation-in- 
time-frequency algorithm (DITF), the fast Harley transform and the quick Fourier 


transform. His final conclusion for his application 1s the adoption of the Radix-4 FFT. 


Secondly, he analyzes the structure of a Radix-4, 64-point, in-place decimation in- 
frequency (DIF) FFT, identifying its major modules. He explains the basic principles of 
each module and writes a brief note about the interconnections and the possible 
controlling algorithms that he uses in his C++ implementation. His design contains four 
major components. The first component is a computer factor component that generates 
the needed factors. The second component is comprised of two 64-sample ROMs that 


store the computed phase factors and the twiddle factors. The third component 1s a 


multiply accumulator that consists of one butterfly (BF) machine, one adder and two 
registers, to compute the real or imagery part of the stage’s output. Finally, the fourth 


component is a controller that synchronizes all of the components. 


In the third part of his thesis, Gkikas examines the structure of the butterfly 
machine, and based on the theory of the improved Radix-4 algorithm, derived from the 
Cooley and Tukey algorithm [6], he suggests the use of the factor terms. This structure 1s 
helpful because a minor modification to the factor’s equations, while leaving the rest of 
the design intact, provides an inverse FFT (GGFFT) algorithm module. Moreover, the use 
of factors improves data reusability and decreases the number of required computations. 


He also studies the phase factors and the correct use of them concerning this structure. 


Finally, he examines three different structures of FFT, based on the number of 
butterfly machines that each of them includes. He first considers a 48-butterfly FFT, 
where each of the three stages has 16 butterflies and that each butterfly is used only once 
in every 64-point block of data. After that he examines the use of a 16-butterfly FFT, 
where all three stages use the same 16 butterflies, so that each butterfly 1s used three 
times, once per stage. Then, he explores the use of a 4-butterfly FFT, where each 


butterfly is used 12 times, four times per stage, for three stages. 


2: Sullivan Thesis 


Based on Snodgrass’ dissertation [1], Sullivan illustrates the application of RPR 
as a new method of fault tolerance in FPGAs against single event effects [4]. She 
examines the use of different degrees of RPR depending on the arithmetic operation and 
compares the impact of implementation of those degrees to area consumption. 
Specifically, she categorizes the problems that are suitable for RPR implementation into 
two major divisions. These categories depend on the required computation, and are 
addition/subtraction or multiplication/division. Later she investigates the possibilities of 


implementing RPR in a FFT algorithm. 


For each category, she describes the mathematical relationships, and depending 
on the two operands, she defines each case. She determines the lower and upper bounds, 


and at the same time, points to special cases that demand unique handling in order to 
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avoid possible errors or overflows. In a similar manner, she designs a RPR voter for each 
task, acknowledging the differences between addition voter and multiplication voter, and 
provides useful tips about the necessary signals, functions behavior and checking 


comparators of the module. 


In the final portion of her research on the arithmetic operations, she comes to the 
conclusion that the use of RPR instead of TMR in the addition/subtraction operation does 
not guarantee less area occupancy in every case. After checking the results of FPGA’s 
area comparison, she confirms that “... in order for RPR to be a more desirable fault- 
tolerance approach than TMR for a simple operation like addition or subtraction, the 
degree of RPR must be significantly less than 0.5 - and that for both the adder and the 
voter in a RPR addition process to be smaller than the analogous TMR modules the 


degree of RPR must be less than 0.25 [4].” 


Moreover, she concludes that for the multiplication operation “... a RPR 
multiplication module requires 1/3 to 1/2 the FPGA slices of a TMR multiplication 
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module depending on the degree of RPR [4].” However, she notes that the size of the 
multiplication RPR voter is extremely large when compared to the TMR voter module, a 
drawback that should not be underestimated. Another crucial question that she tries to 
answer 1s the dilemma of whether to test intermediate results or just the final result for 
error. She points out, “Any benefit of testing intermediate results in RPR processes must 


be considered against the additional space it requires [4].” 


Later, she explains the basics of a FFT algorithm and reveals thoughts about 
different ways of implementing a RPR voter in a FFT. According to those thoughts, 
“...we may include one or more voters on the final or intermediate results [4].” 
Therefore, there are many choices; for example, we can implement a voter after each 
multiplication, or after each complex product or maybe at the end of each stage. She 
predicts that using an 8/32 degree of RPR in a FFT butterfly operation instead of a TMR 
will save 66 percent of occupied FPGA slices. Finally, she concludes that using a RPR 
voter for a few major points within the system will keep the area cost of the FPGA low 


compared to the cost of TMR. 


B. PREVIOUS PAPERS 


Various papers, application reports and notes were also considered in this thesis. 
The following notes are not an abstract from these reports. Only the ideas and thoughts 


that were considered useful for the design are discussed. 


1. Application Reports and Notes 


In [7], Wu describes the implementation of a Radix-4 DIF FFT using the Texas 
instrument TMS320C80 digital signal processor (DSP). Although the report is dated, it 
reveals basic FFT implementation principles in a parallel processor and points out 
possible errors and/or hazards due to data overflow. In [8], Delphin of ST Industries 
describes the implementation of the Radix-4 FFT Algorithm using the ST120 DSP. 
There she explains the basic structure of a FFT Radix-4 and a new algorithm that is 
derived from the Cooley and Tukey algorithm [6]. She discusses the correct order and 
the characteristics of each stage BF’s factors and twiddle numbers. In order to save 
memory resources, she chooses the “in-place” FFT structure, but she notices the need for 
each stage’s output digit reversing procedure. Finally, she provides a C++ example 


program that reveals the structure of specific parts of her design. 


2: Papers 


There are three relevant papers that discuss the design and implementation 
improvements of a Radix-4 FFT algorithm in a FPGA. Each of them provides important 
advice concerning the structure of the design considered in this thesis. In the first paper 
[9], Sun, Liu and Ji give a detailed description of the FFT’s memory structure, RAM 
storage capabilities and restrictions of RAM used in a FPGA. They consider a new way 
of handling data addresses to allow manipulation of four input and output data streams, 
while at the same time, bypassing the limitations of modern two-port RAM hardware. In 
the second paper [10], Bouguezel discusses the profits of using the improved Radix-4 
FFT algorithm in comparison to the Cooley-Tukey FFT algorithm [6]. In the third paper 
[11], Chao, Qin, Yingke and Chengde work on the design of a high performance FFT 
processor, based on a FPGA. They introduce two important ideas for the optimization of 


the FPGA, ideas that could be implemented in future iterations of the design presented in 
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this thesis. The first idea is the lifting scheme, which 1s a transform that reduces the 
number of real multiplications in a BF from four to three. While of benefit, note that it 
also increases the number of real additions from two to three. The second idea 1s the use 
of an adaptive overflow calculation. This novel approach ensures that no overflow takes 
place over the entire calculation. This 1s performed without limiting the data path width 


or decreasing the efficiency of the BF. 


With important information from previous theses, application notes and papers 
considered, the theoretical background required for the creation of the FFT design 1s 


discussed in Chapter III. 


THIS PAGE INTENTIONALLY LEFT BLANK 
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Hl. THEORETICAL APPROACH 


A. PROBLEM DISCUSSION 


In order to successfully implement a FFT in a FPGA, many aspects must be 
considered and wise decisions must be made concerning different algorithms, geometries 
and decimations. Each approach has advantages and disadvantages but the most suitable 
one must be selected. The goal is the implementation of a 64-point FFT that can handle 
data in real time, while occupying as few resources as possible. Before discussion of 
design optimization, where the three basic categories of decimations, geometries and 


algorithms must be considered, it 1s useful to describe the discrete Fourier transform. 


1. Discrete Fourier Transform 


The discrete Fourier transform (DFT) plays an important role in many 
applications of digital signal processing. The reason for its importance is the presence of 


efficient algorithms for computing the DFT. 


The DFT sequence | X (k)} of N complex-valued numbers, given a sequence of 


data {x (n )}of length N, 1s computed as: 


N-1 
X(k)= > x(nwy" OS KS N-1 
n=0 


—j27/N 
W,, =e” 
where: N 


N-1 
x(n)=—S xo" O<n<N-1. 
N k=0 


For each value of k, N complex multiplications (4N real multiplications) and N-1 
complex additions (4N-2 real additions) are required. ‘This indicates that in order to 
compute all N values of the DFT, N* complex multiplications and N* —N complex 


additions are required. 
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Unfortunately, direct computation of the DFT 1s inefficient due to the fact that 


“...1t does not exploit the symmetry and periodicity properties of the phase factor W,, 


[12].” These properties are summarized as: 
Symmetry property: Wi"? =-W. 
‘ eas k+N k 
Periodicity property: W, =W, 


The solution to this problem is the decomposition of the N-point DFT into 
successively smaller DFTs. This new approach, named divide-and-conquer, made the 
computation of the DFT more efficient, leading to fast Fourier transforms (FFT) 


algorithms. 


2. Fast Fourier Transform 


The fast Fourier transform (FFT) is an efficient algorithm that uses a reduced 
number of arithmetic operations as compared to DFT, “... eliminating redundancies that 
result from adding certain data sequence values after they have been multiplied by the 
same factors of fixed complex constants during the evaluation of different DFT transform 
coefficients. The efficiency is achieved at the expense of reordering the data sequence, 
but the additional expense 1s generally small compared to the reduction in multiplications 


and additions.” [13] 


In order to use an efficient FFT algorithm based on the divide-and-conquer 


approach, the number of data points, N, is highly composite, meaning N can be described 


as N=r,*r,*7r,*...*r,, where {7;} are prime. 


~ 13 Se = =" then N=r" and the number r is called 


In the case where ! ? 
the radix of the FFT algorithm. Each FFT radix-r algorithm can be categorized into two 
groups, depending on the decimation chosen. The two groups are the decimation 1n-time 
(DIT) algorithms, where time samples are computed in alternating groups, and the 
decimation in-frequency (DIF) algorithms, where frequency samples are computed, 


separately, in alternating groups. 
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a. Radix-2 Versus Radix-4 Versus Split-Radix 


When comparing the three major FFT algorithms, Radix-2 DIF, Radix-4 
DIF and Split-Radix DIF, the following conclusions can be made [5], [14]: 


The Radix-2 DIF algorithm (Figure 1) significantly decreases the number 


of complex multiplies compared to the DFT from N’ to (N/2)log, N and complex 
additions from N(N—-1) toN log, N. It is a very popular algorithm due to the symmetry 


and periodicity properties of its twiddle factors. 


G(i) 


H(i) 





length-2 DFT “twiddle factor” 


Figure 1. Radix-2 DIF Butterfly (From [15]) 


The Radix-4 DIF algorithm (Figure 2) yields an even greater decrease in 
the number of complex multiplies as compared to the Radix-2 DIF. The reduction is 


from (N/2)log, N to (3N/8)log, N. Therefore, the Radix-4 DIF requires 75 percent 


as many multiplies as Radix-2 DIF and the same number of additions. It also 


demonstrates the same symmetry and periodicity properties of its twiddle factors. 
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ib) 


Figure 2... Radix-4 DIF Butterfly (From [16]) 


The Split-Radix DIF (Figure 3) is computationally superior to both the 
Radix-2 and Radix-4 algorithms when considering the number of required multiplications 
and additions. However, this algorithm has the disadvantage of structure irregularity, a 
disadvantage that prohibits its use in an implementation design effort [12]. So, after 
comparing these three popular algorithms, the Radix-4 DIF algorithm 1s selected for the 


design considered in this thesis. 


xf} 
Use for A{2%) 
x(t +) 
4 We 
ANZ Use 
x(n+ 9) c Nes RO xe 


av a N7 Use for 
xn+7) ¢ i : wy Ask + 33 


= 
A B we 


Figure 3. Split Radix DIF (From [16]) 
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b. In-Place Versus Constant Geometry Structure 


The term “in-place” (Figure 4) refers to the fact that each time a butterfly 
is computed, the correct set of data is read from memory and the products from the 
butterfly computation are written back into the same set of places in memory [17]. The 
advantage of the “in place” 1s the fact that the set of data that are needed each time is the 
same set of data that are produced by the butterfly. This means that data can be 
overwritten, in order to minimize memory requirements. This is very useful in 
implementing a FFT in an old FPGA, where the goal is using the least amount of required 


memory, vice achieving real time signal processing. 


The term “constant geometry” (Figure 5) indicates that the connections 
between memory slots are the same in each stage. The advantage of this structure is the 
fact that it is less complicated than the “in-place” structure, reducing hardware size. 
Obviously, “constant geometry” seems more attractive for a highly parallel hardware 
implementation [18], but the potential of the “in-place” in an old FPGA results 1n its 


preference for the thesis design. 


x[O] o X[0] 
x{1] > X[1] 
x[2] > X[2] 
x[3] 2 X[3] 
x[4] NOX So ~ LY, | / x Woo X[4] 
x[5] So X[5] 
x[6] oA ; . : mo Wo x16] 
x[7] so X[7] 





Figure 4. | Radix-2 DIF FFT with in-place input and output (From [15]) 
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X([6] 


x[4] o KS. — Kes o—-—o KX “Soe—o X11] 


x[5] o% s - Foo X[5] 








x6] © so oe Boo X13] 
x[7] Wai aa ene x[7] 


Figure 5. Radix-2 DIF FFT with constant geometry structure (From [15]) 


B. CONCEPTUAL DESIGN MODEL 


In order to evaluate RPR concepts in a FFT, many aspects are considered in 
choosing an implementation. The final choice for this thesis is a 64-point Radix-4 in- 
place DIF FFT implementation in the Virtex-2 XC2V6000 FPGA. The following section 


provides a brief overview of the theoretical approach to this thesis design. 


1. 64-Point Radix-4 in-Place DIF 


The Radix-4 DIF FFT divides the 64-point DFT into four 16-point DFTs, then 
into 16 four-point DFTs [7-8]. The butterfly of a Radix-4 consists of four inputs and four 


outputs, as shown below: 


N/4-1 


; N/2-1 ; 3N/4-1 : N-1 : 
X (k)= > x(n)W,,” + > x(n)W,,” + > x(n)W,," + > x(n)W,," 
n=0 


N/4 N/2 3N/4 


N/4-1 N N XN. 
— >; [x(1) + x(n + We ; +x(n+ POG ‘ + x(n+3 Ws e WW,” 
n=0 
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Mk 
where: Wt =(-j) 
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alr 
Wy =(- 1)" 
37k a 
Wy a (j ) 


Combining terms yields the following equation: 
N/4-1 


x(= ¥ [x(n) +) ant) +CDiatn + (x(n 3D)" 


n=0 


To get four-point DFT decomposition, we decompose again as: 


X(k)= z= [x(n) + x(n + * +x(n+ >) + x(n+3 SM 
X(k+1)= = [x(n) — jx(n+ >) —x(n + ~) + jx(n+3 ws 
X(k+2)= > [x(n) — x(n + >) +x(n+ ~) ~x(n+3 wee 
X(k +3) = > [x(1) + jx(n + >) —x(n+ S) — jx(n+3 we 


This 1s called the butterfly and is repeated for all four-point bundles as shown in 


Figures 6 and 7. 


x(4r) -> leg lof the butterfly 


x(4r+1) -> leg2 of the butterfly 


x(4r+2) -> legs of the butterfly 


X(N+3N/4) wm x(4r+3) -> leg4 of the butterfly 





Figure 6. 16-Point Radix-4 DIF FFT Butterfly (From [8]) 
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Figure 7. 64-Point Radix-4 DIF FFT 
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per An Improved Radix-4 DIF FFT Algorithm 

The improved Radix-4 DIF FFT algorithm is derived from the following 
equations. 

The basic equations of the FFT’s butterfly are: 

A=[a+b+c+dW, 

B=|a- jb-—c+ jd|W, 

C =[a—b+c-—dW,." 

D=[a+ jb—c— jd]W," 


where: 
a=a,+ Ja,, 
b=b_+ jb, 
C=C 4. Jc. 
d=d_+ jd,, 

and Wy a Wy Wa i 


Transforming the upper equations yields: 
A=[(a, + ja,,)+(b, + jb,,)+(c, + jc,,)+(d, + jd,, Wy 
B — [(a, F IGin) 7. J, ae Jb) a (c. Fr ICin) ay j(d, a Jd;,,) Wy 
C = [(a, Sy Gin) —(b, ay Jb.) a (c, ob ICin) =(G, ate jd,,,) Wy" 
D=[(A, + jain) + J, + JD) —(C, + Jin) — Hd, + iin Wy" 
For the computation of the real and imaginary part of the first output of the 
butterfly, only the butterfly’s inputs are added, without multiplying any phase factors. 
A.=a,.+b.+c.+d, 
(III.1) 
Ain = Gin os Din os Cim a din 


The real and imaginary parts of the B, C, D outputs produce six common factors: 
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factor, =a,+b,,—c,.—d,, 

factor, =a,—b.+c,.—d. 
actor, =a,.—b,,—c.+d. 

aL 3 r im r im (11.2) 
factor, =a,,—b.—c,, +d, 

factor, = a,,, —,,, + Cin — Lin 

factor, =a,,+b.—c,,—d, 

Expressing the B, C, D outputs of the butterfly in factor terms: 

B. = factor, *W,,_, + factor, *W, 
B,,, = factor, *Wy 
C. = factor, We 
C,,, = factor, *W," , — factor, * Wy" in, 

D, = factor,*W,," , + factor, * Wy" in, 

D,,, = factor, *W,', — factor, Woe 

We = cos(27zkn/ N) 

WW." =-—sin(2zkn/ N) 


N_im 


_im 


= ky” 
, — factor, Wy im 


+ factor, * Wy" in (111.3) 


where: 


based on the reference of Table 1. 


Table 1. | Characteristics of 64-point Radix-4 FFT 


Twiddle Factor Exp. 


Legl 
Leg2 
Leg3 
Leg4 





Based on the theoretical background discussed above, it is clear that the best 
design implementation for this application 1s a 64-point Radix-4 in place FFT, which can 


handle 32-bit fixed point signals in real time. 
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IV. DESIGN IMPLEMENTATION DETAILS OF FFT 


This chapter focuses on the design implementation of the FFT. The objective is 
to create a FFT that contains only modules, based on this research, designed in a Xilinx 


environment. This implementation is for a Virtex II FPGA XC2V-6000-6bf957. 


A. DESCRIPTION 


The FFT design contains three identical stage modules, one main controller and a 
last-stage reversing module, as shown in Figure 8. It receives as inputs, a 32-bit real 
number and a 32-bit imaginary number, both of which are fixed-point, normalized, two’s 
complement numbers. It produces as outputs, a 32-bit real number and a 32-bit 
imaginary number. The detailed FFT design is included in Appendix A. According to 
analysis conducted in this thesis and based on an application report from Cheng [18], 
there is the possibility of a 3-bit overflow per stage, meaning a 9-bit overflow for the 
entire FFT. In order to avoid overflow issues, a simple solution was chosen. Input 


values were restricted to values less than 2”. 
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Figure 8. 64-Point Radix-4 in Shape DIF Signed Fixed Point 32Bit FFT—The Upper 
Level 


1. Stage 1-2-3 


All three stages are constructed identically, are pipelined, and their latency is 85 
clock cycles per stage. These three stages are depicted in Figure 9. They include a stage 


controller, a RAM, a compute factor, a BF multiplier and a multiplexer module. The 


Z2 


stage controller generates addresses, the RAM receives those addresses and the input 
signals and sends the output signals to the compute factor. The compute factor computes 
the factors and sends them to the BF multiplier or to the multiplexer. The BF multiplier 
performs the necessary multiplications and sends the data to the multiplexer. The 
multiplexer chooses between incoming data from the compute factor or from the BF 


multiplexer. 
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Figure 9. 64-Point Radix-4 in Place DIF Signed Fixed Point 32bit FFT—Stages 1&2&3 
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a. Stage Controller 


The stage controller receives three and exports four signals. The stage 
controller of each stage waits for the write enable (WE) signal of the main controller. 
From this time on, it enters into a continuous process of creating input and output 
addresses for RAM, based on the algorithm depicted in Figure 7. It also generates the 


addresses for the ROM’s twiddle factors and the switch command for the multiplexer. 


b. RAM 


The RAM module contains a ROM memory with the needed twiddle 
factors and two 128-point, 32-bit RAM from Xilinx CORE Generator tool. The RAM 1s 
the only part of the design that is made with the CORE Generator. The RAM receives 3 
addresses, one for the incoming input signals, one for the required output signal and one 
for the required twiddle factors. Each of the real and imaginary input signals are stored 
in one half of the RAM and are exported only when the complete 64-point signals are 
collected. The RAM module exports four signals, the real and imaginary signal and the 
real and imaginary twiddle factor. Note that the ROM’s stored twiddle factors are not 


rounded. 


c: Compute Factor 


The compute factor has four inputs and five outputs. It needs four pairs of 
signals in order to compute the proper factors and starts the procedure after commanded 
by the main controller. It computes three pairs of factors that are sent to the butterfly as 
demonstrated in Equation (III.2) and a pair of signals that are ready for the next stage as 
demonstrated in Equation (III.1). This pair of signals bypasses the BF multiplier and 


goes directly to the multiplexer. 


d. BF Multiplier 


The BF multiplier is the most complex module of the design. It receives 7 
inputs, among them a pair of input factors from the compute factor and a pair of twiddle 


factors from RAM. It outputs the real and imaginary part of the computed signal. 
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The aim is to compute the parts of Equation II.3. This 1s demonstrated by 


the following computation: 


B. = factor,*W,, .+ factor, *W,, 


” (IV.1) 
B,,, = factor, *Wy_, — factor, *W,, 


N_im 


Thus, four multiplies and two additions are required. For each multiply, 
the Virtex II embedded pipelined multiplier, MULT18xI18S, 1s used. This choice was 
made to ensure maximum possible speed for our pipelined FFT. This was based on the 
fact that the embedded multipliers are probably faster than any behavioral multiplier that 
could be designed for this thesis work. The disadvantage of this choice 1s the limit bit 
number of each multiplier. The Virtex If embedded multiplier can handle 18-bit signed 
two’s complement numbers and outputs a 36-bit result. However, the input signals in this 


design are 32-bit long. This requires the use of four embedded multipliers for each 
multiplication. To illustrate, computation of the first product, factor, *Wy is shown in 


Figure 10: 
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Figure 10. BF Multiplier—Use of Embedded Multipliers—Part1 


Figure 10 clarifies the situation. It can be observed that after the import of 
the two 32-bit long signals, a factor and a twiddle factor, four different outputs are 


received from the embedded multipliers. Next, the signal is subject to the concatenation 
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or addition of bits in order to secure the correct multiplication of the signed fixed point 


numbers. Finally, in the third stage three 64-bit long signals are received. 


At the fourth stage, those three signals are sent to the carry save adders 
module where the number of the desired signals is decreased from three to two. This 1s 
depicted in Figure 11. At the end of the fourth stage, two 64-bit long signals are 
available for import to the carry look ahead adder module. This module has a structure 
similar to that depicted in Figure 12. The carry look ahead adder adds the two 64-bit 
numbers and outputs the product that 1s being concatenated. The concatenation is based 
on the fact that although two 32-bit fixed point signals (with 30-bit fractional number) are 
imported, 36-bit fixed point numbers (with 30-bit fractional number) are being 


multiplied. 
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Figure 11. BF Multiplier—CSA module—Part2 
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64-Bit Carry Lookahead Adder 
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Figure 12. BF Multiplier-—CLAH Module—Part3 (From [19]) 


So far, only one product of Equation IV.1 has been computed. However, 
four identical designs of the structure, which is shown in Figure 13, are used to compute 
the four products of Equation IV.1. Note that during the concatenation procedure, no 


rounding occurs. 
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é. Multiplexer 


The multiplexer receives a pair of input signals from the BF multiplier and 
a pair of signals directly from the compute factor module. The stage controller 
manipulates the multiplexer and determines which of the two pairs are going to be sent to 


the next stage. 


2. Main Controller 


The main controller is the module that manipulates the entire design by managing 
the controllers of each stage. It contains a reset command that restarts the entire system 
and two pairs of signals that awake on time the stage controllers and the compute factors 


of each stage. 


3. Reversing Last Stage 


The reversing last stage module generates the output address of the output signals. 
In order to avoid using another RAM for storing the signals in random order and 
outputting then in an arithmetic order, the preference is to export them in the in-place 


random order and indicate the correct order by using an address monitor. 


B. IMPLEMENTATION EFFORTS AND RESULTS 


In order to investigate the use of the RPR module it must first be compared to a 
TMR module that works under the same initial conditions. Therefore, a TMR 64-point 
Radix-4 in place DIF was designed and then implemented in the Virtex-2 XC2v6000 
FPGA. Note that the voter in both cases (TMR and RPR) must be in the same position. 


In any other configuration a comparison would be ineffective. 


1. Implementing a TMR 


The design of a TMR is a simple process when compared to the design of a RPR. 
It is necessary to choose the frequency of the check points in the FFT and evaluate the 
values of the three identical modules through a small voter. Then the results are verified, 
with the expectation that at least two of the three values will be identical. In this manner, 


a FPGA 1s protected from unwanted space radiation. 
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As the frequency of the check points increases, the protection effectiveness 
increases. But, the increased use of voters raises the capacity demand of the design. A 
good compromise position for voters is to use one voter per FFT stage. The practical 


implication is that only three voters are required for a 64-point Radix-4 FFT. 


At the conclusion of the implementation effort, a successful 64-point Radix-4 in- 
place DIF in a Virtex-2 XC2v6000 FPGA was designed and implemented. The entire 
design 1s included in Appendix B. The resources that were used at this time were much 
less that 12 percent of the slice resources of the FPGA, except from one, the embedded 
18 x 18 pipelined multipliers. Exactly 3 x 4 x 4=48 embedded multipliers were required 
from the total of 144 embedded multipliers available in the XC2v6000 FPGA. Based on 
calculation, if the goal is to create a TMR, three times the number of embedded 
multipliers is required in the primary design. Therefore, less than 36 percent of the 
FPGA’s slices and 3 x 48 = 144 embedded multipliers should be required. Thus, the 


entire amount of available embedded multipliers was required for this design. 


At this point all indications were positive and the design of the TMR was 
designed. After successfully synthesizing and simulating the design, an attempt was 
made to implement it in the FPGA. However, it was discovered that the FFT could not 
be implemented in the FPGA. The reason was based on the demand of the 6-block RAM 
of the module for six empty adjacent embedded multipliers. The embedded RAMs and 
multipliers were sharing the same routing resources in the Virtex II FPGA. Thus, the 
program required 144 + 6 = 150 multipliers or 104 percent of the available resources and 
the implementation effort failed. Nevertheless, in order to compare the TMR and RPR 
source and power requirements, an implementation of both modules was attempted on the 


larger Virtex IIT XC2V8000 FPGA and is discussed in the next Chapter V. 


2. Implementing a RPR—First Attempt—RPR Degree 8/32 


Based on the previous concepts, a RPR module was created using, as its core, the 
primary design, with RPR voters embedded in the end of each stage. This design 
consisted of one precise module unit and two smaller average precision module units. 


The precise module unit was identical to the primary design module and handled 32-bit 
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real and imaginary inputs, while outputting 32-bit real and imaginary results. The 
average precise module units were distinct in the upper and lower bound units, in that 


each of them defined the upper and lower limits of the precise module. 


a. RPR Upper and Lower Modules 


The upper and lower average precision modules were almost identical. 
They truncated the 32-bit real and imaginary input into 8-bit numbers. In the upper 
module, the truncated number was increased by one bit on the least significant bit (LSB). 
The 8-bit number passed through the modified compute factor module and entered the 
modified BF multiplier. Inside the BF multiplier, due to the need for an 8 x 8 bit 
multiplier instead of the 32 x 32 bit precise module’s multiplier, a single 18 x 18S 
embedded multiplier was used instead of four. There was no need for CSA or CLAH and 
the result of the multiplication occurred sooner than expected from the precise module 
unit. In order to keep the program synchronized, a delay was inserted after the 8 x 8 BF 


multiplier to equalize the latency, as depicted in Figure 14. 


33 


















8x8 bit BF 
Multiplier Delayer 
1 Mult18x18S 





Compute Factor 





























32x32 bit BF Multiplier 
4 Mult18x18S RPR Voter 


Compute Factor CSA, CLAH 
































8x8 bit BF 
Mutliplier 
1 Mult18x18S 








Compute Factor 


Delayer 














Figure 14. | Reduced Precision Redundancy Stage Portrayal 


b. Overflow Approach 


In the primary design of the FFT, the overflow problem was recognized 


and the simplest solution was implemented. Therefore, the input was restricted to 


34 


normalized signals with values less than2’. This secured the design from possible 
overflow events. In the first attempt at implementing RPR, a different approach was 
elected. Based on Cheng’s work, “Autoscaling Radix-4 FFT for TMS320C6000” [20], a 
3-bit shifting per stage approach was adopted in order to avoid overflow issues. The 
specific implementation was the adoption of 3-bit shifting in the input of each RAM’s 


Stage. 


C. Ambiguity Phenomenon 


In the design of the RPR modules, the upper bound 1s discriminated from 
the lower bound modules by adding one bit in the LSB of the truncated 8-bit number. 


This action directs us to the following equations: 


Ariee pee pper 

ipiep = paces Bopper 
Cries S Cres < Cupper 
D <D <D 


Lower Precise Upper 


Therefore, using the following equations: 


JACION ample = A+ B-C-—D 


Real = faCtOF gape, * Wy real — facto 


* e 
Vinge W,umage 
9 


yields the following results for the upper, lower, and precise modules: 


fe AClOT ample, = Pigs: oF B Lower Cie _ D Lower 
fe actor, Examplep, ecise e Abx ecise +r B Precise Cor ecise ms D Precise 
factor, Exampleypye, “Upper + Bopper 7 (Upper 7 Donner 


But, these outcomes lead to the following ambiguity: 


> > 
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A solution to this problem 1s the creation of a unified upper and lower 


bound module where each equation loads upper or lower values depending on the sign. 


fe AClOV pxample,,,.. = Aisi fle Lower Cupper Do pper 

fe AClOV pramplep, = Ab, ecise +B Precise Cor ecise —D Precise 
fe ECON 5 amp leripre rs Upper am Bopper 7 C races 7 Digits 
And so: 

fe AClON ample, < fe ACO amply cic = SACO gape, 


d. RPR Survey Problems 


Incorporating 3-bit shifting in each stage of the FFT introduces a total of 
9-bit shifting at the final output. If 8-bit RPR shield (8/32 degree) is obtained, it is 
concluded that in a non-overflow case, that RPR is incapable of securing the precise 
module as illustrated in the following equations. Figure 15 makes clear the fact that in a 


non-overflow case, the RPR cannot sufficiently protect the third stage FFT. 


3rd _ Stage Precise Output(not_ overflow) <2” 
3rd Stage RPR_ Shield >2~° 
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Figure 15. RPR Shield and 3-Bit Shifting Obstacle 
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There are two possible solutions to this problem. Either a program must be 
created that recognizes and manipulates the possible overflow cases by shifting the input 
data by the appropriate amount of bits only when it is desired, or the degree of the RPR 
must be increased. In the first situation, the concept of an intelligent overflow 
manipulator is most desirable in order to preserve the 8/32 degree of the RPR. However, 
this manipulator will increase the complexity of the design due to the need for marking 
and shielding of the shifted bit amount throughout the entire FFT structure. In the second 
case, an increase of the RPR degree is not desirable, but seems rather attractive at this 


point because it minimizes alteration of the primary module. 


3. Implementing a RPR—Second Attempt—RPR Degree 14/32 


Taking into consideration the ambiguities and errors from the previous 
implementation, a RPR module was created, using the primary design as its core, with 
RPR voters embedded at the end of each stage. This design is included in Appendix C 
and consists of one precise module unit and two smaller low precision module units. The 
precise module unit is essentially identical to the previous design module ([V.B.2) and 
handles 32-bit real and imaginary inputs, while outputting 32-bit real and imaginary 
results. In this case, the only difference is that the bound module units are identical and 


not distinct into upper and lower bound units. 


a. RPR Bound Modules 


The RPR bound module truncates the 32-bit real and imaginary input into 
a 14-bit number. The 14-bit number passes through the modified compute factor module 
and enters the alternated BF multiplier. Inside the BF multiplier, due to the need for a 14 
x 14 bit multiplier instead of the 32 x 32 bit precise module’s multiplier, only one 18 x 
18S embedded multiplier was used instead of four. There was no need for CSA or 
CLAH and the result of the multiplication occurred sooner than expected from the precise 
module unit. In order to keep the program synchronized, a delayer was used after the 14 
x 14 BF multiplier. The major difference between the first and the second RPR 
implementation was the altered approach of the RPR function. Instead of using an upper 
and lower bound, two identical bounds were used that handle the truncated 14-bit value 
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of the 32-bit input. At the end of each stage, the voter inspects the duplicate truncated 
values and proceeds to a simple bit by bit comparison. If the values are identical, then 
the voter computes the expected upper and lower limits based on the truncated 14-bit 


output and on the theoretical expected errors as shown in Equation IV.3. 


=A 


Pr ecise 


+B 


Pr ecise Pr ecise Pr ecise 


factor, 


recise 


fe actor, Trunc = A, Trunc te B Trunc Counc a D Trunc = ? 


FACIOF pane up — (Anune + | D1) a (Bisa oF | D1) a Counc a Drrunc 7 sa ar 2 b| 0 


SACtOT yng Lo a runc BS Brune ,* (Cie + I bl) “ (Dyn + I bl) a YP a 2 ‘D1 0 
In the worst possible case, 
—_ ' < < ' 
Y—2b'10< factor,,.;.. <‘¥ +2b'10 (IV.2) 
Using the same method, 
ae ) 
Real, ecise = fe actor, ecise Wy _ Precise real = fe actor, ecise . Wy  Precisel MA § € 
a ° 
Red pin. = fe actor, Trunc “Wy Trunc Pr eal fe actor, Trunc “Wy Truncl MA§ c= @ 


Trying to compute the worst possible case, ([V.2)*1 + UV.2)*1: 
O—3'b100< Real <O+3'b100 (IV.3) 


Precise 
The voter compares the precise value to the expected upper and lower bounds and 


decides to choose either the precise value or the average (truncated) value. 


b. Verifying Results 


In order to verify the truth of the outputs of the implemented FFT 
structure, a MATLAB FFT simulation file was created. This file was used to compare 
the output forms of the three different sub-programs. The first subprogram was actually 
the built-in MATLAB FFT, the second was a clone of the Verilog design in the 
MATLAB environment, and the third was a MATLAB translator for the implemented 
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FFT results. Since MATLAB usually handles 64-bit numbers and due to the decision for 
3-bit shifting of the input in every stage of the FFT, a possible error was identified in the 
precise calculation of the FFT between MATLAB’s calculated outputs and Verilog’s 


translated outputs. These expected errors are tabulated in Table 2. 


Table 2. | Expected Errors of Precision Calculation 


FFTs Outputs Expected Error 


Output of the 1“ Stage < 9725 


si ad 
eee toe 





The purpose of the entire design was the testing of the implemented FFT 
in a radiation environment. Prior to this, testing was conducted via simulation of possible 
radiation effects. For this reason, a radiation module (Figure 16) was imported into the 
FFT design that was capable of introducing error bits at the beginning of each stage, prior 


to the butterfly calculations. 
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Figure 16. FFT and Radiation Module 


The radiation module allows great flexibility in the introduction of error 
bits, giving the ability to decide in which stage, in which module (one of the truncated 


RPR or the precise value) and for how long this “form” of radiation lasts. Errors 
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introduced in the first stage have less impact on the precision of the average RPR in 
contrast to errors introduced in the last stage as shown in Table 3, again due to the three- 


bit shifting decision. 


Table 3. Errors Expected Due to the Import of Radiation in the Precise Module 


Radiation introduced in the precise module Expected Error 





Radiation introduced in the 1“ stage 
Radiation introduced in the 2"" stage a) a 
Radiation introduced in the 3" stage ae 


In this section, the implementation efforts of three different modules, 
TMR, RPR with 8/32 degree, and RPR with 14/32 degree were discussed. In the next 


chapter, the results analysis will be covered based on these designs. 


V. RESULTS 


The efforts for implementing a TMR version of the FFT in the Virtex II 
XC2V6000 were fruitless, forcing a revision to the plan. In order to compare TMR and 
RPR modules, the same FPGA chip had to be used, so it was essential to use a larger 
FPGA of the same family. The next larger, available Virtex Il FPGA, the XC2V8000, 


was used. 


First, the synthesis reports for the RPR and TMR modules were compared, as 
indicated in Table 4, where a significant difference in occupied resources is noted. RPR 
needs 74 percent of the slices that TMR uses, 76 percent of the slice flip flops, 68 percent 
of the four-input LUTs and 50 percent of embedded multipliers that the TMR uses. In 
Figure 17, the differences identified in the synthesis reports are presented. 


Table 4. | Synthesis Results—Comparison between RPR and TMR in a Virtex I 
XC2V8000-5FF1152 FPGA 


Virtex II | TMR module | TMR’s RPR module RPR’s 
XC2V8000 occupancy % occupancy % 
11028 8236 


# Slice Flip | 18525 19% 14067 15% 


Flops 
#BRAMs 6 13% 
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Figure 17. TMR Vs RPR Synthesis Results 
Secondly, the Xilinx’s XPower Analyzer was used to investigate the power that 
each of the modules required for operation. When the outcomes from both cases were 
compared, an interesting conclusion resulted. Although both systems needed the same 


amount of quiescent power, RPR required 19 percent less dynamic power than TMR. 


These results are tabulated in Table 5 and depicted in Figure 18. 


Table 5. ©TMR Versus RPR—XPower Analyzer Results 


XC2V8000 
Power 


Total Dynamic | 0.17408 W 0.14079 W 
Power 


0.31193 W 0.27864 W 


Junction 28.3 degrees C 27.9 degrees C 
Temperature 
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Total Quiescent Power Total Dynamic Power Total Power 





Figure 18. ©TMR Vs RPR—xXPower Analyzer Comparison 


After inspecting the synthesis and power reports from Xilinx for the two 
implemented modules, RPR with a degree of 14/32 and TMR, a more detailed conclusion 


can be made concerning the RPR method and 1s discussed in greater detail in Chapter VI. 
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VI. CONCLUSIONS AND RECOMMENDATIONS 


A. SUMMARY 


The objective of this thesis was the creation of a FFT structure that would be 
implemented in a FPGA, adopting two different methods of redundancy, TMR and RPR. 
The purpose was to investigate the capabilities of RPR. First a simple 64-point Radix-4 
in-place FFT was implemented that could handle fixed point, 2’s complement numbers 
and was tested with accurate results. Next, a TMR structure was designed by replicating 
three identical FFT structures and by importing a voter at the end of each stage. The next 
stage was the design of a RPR structure with a degree of 8/32. The design was 
successfully implemented, but it failed to protect the system against radiation. Taking 
into consideration the problems from the previous unsuccessful design, the construction 
of a new RPR structure with a degree of 14/32 was conducted and resulted in a design 


that worked correctly and managed to protect the FFT structure efficiently. 


One of the major concerns about the RPR method is the size of the voter. This 
thesis suggests a simple alteration from the method suggested from Snodgrass [1], where 
there is no need for generating upper and lower bounds. Instead, the truncated value of 
the precise number is formed and duplicated. This alteration helps simplifying things and 
decreases the size of the voter, since now the voter can easily, with a bit-to-bit 
comparison, verify the correctness of the truncated values and with a simple addition or 


subtraction can output the predicted theoretical boundaries of the precise value. 


Although several interesting results were obtained, during this research, relative to 
the specific structure chosen (the 64-point Radix-4 in-place FFT), some conclusions with 
broad theoretical impact should also be mentioned. These findings can be summarized 
by the following statements: the RPR method is sufficient, 1t requires fewer resources, 
and is more power efficient than TMR when considering arithmetic operations. 
Additionally, with RPR, there are reduced resource requirements and power consumption 


over TMR, but there is a sacrifice in precision. 
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B. RECOMMENDATIONS FOR FUTURE STUDY 
1. Overflow Manipulator 


The current structure of this design did not permit a further decrease in RPR 
degree, but the addition of an intelligent overflow manipulator in each stage would allow 
abandonment of the high RPR degree and permit protection of each stage with the same 
small amount of bits that are desired. This would enable further investigation of the 
impact of RPR degree, and assist in understanding the trade-off of RPR degree, on 


precision, capacity size, and power consumption. 


2. Implementing a 4 Recursive Butterfly FFT Instead of the 12 Butterfly 
FFT 


The basic FFT module used in this thesis was required to handle data in real time. 
That forced use of a four-BF structure per stage for the Radix-4 FFT, creating a rather 
large FFT processor, requiring a significant number of embedded multipliers (33 percent) 
and a trivial amount of remaining resources (less than 13 percent in all cases) for a Virtex 
Il XC2V6000 FPGA implementation. This revealed that this design had a significant 
weakness, the embedded multipliers. It is interesting to note that in considering the FFT 
design, a recursive four-BF multiple-stage expansion effort was evaluated. This would 
have enabled the ability to keep the majority of the FFT structure intact, altering only the 
controller. Of course, a replacement of the embedded multipliers would permit a 
deviation from the Virtex II family, but even so, the decrease in the number of required 
BFs, 1n addition to, the decrease in the demanded RAM resources would permit the 
development of a less resource demanding FFT, with RPR protection, at the price of 


sacrificing the operational sample rate. 
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APPENDIX A. FFT IMPLEMENTATION 


This Appendix includes the 64-point, in-place, Radix-4, DIF FFT design as it is 
described in Chapter [V.A. The module receives two fixed-point 32-bit inputs (real and 
imaginary) and outputs two fixed-point 32-bit results. The design was implemented in a 
Virtex II XC2V6000 BF957 using Xilinx tools. The synthesis report confirmed that the 
design worked at a clock speed of 225MHz. 
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A2,. FFT MODULE—XILINX BEHAVIORAL AND STRUCTURAL DESIGN 


A2.1. 
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Figure 21. RAM Structure 
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input CLK,RESET,WE_ FFT; 

output reg WE stagel,WE stage2,WE stage3; 
output reg WE Compute Factor Stage1l=0; 
output reg WE Compute Factor Stage2=0; 
output reg WE Compute Factor Stage3=0; 
output reg WE Bit Reversing Last Stage=0; 
reg [8:0] counter FFT=0; 


always @(posedge CLK) 
begin 


WE stagel=WE FFT; 
if (RESET) 
begin 
counter FFT<=0; 
WE stagel=0; 
WE stage2=0; 
WE stage3=0; 
end 
if (WE _stagel==1) 
begin 
if (counter FFT==85) WE stage2=1; 
if (counter FFT==170) WE stage3=1; 
if (counter FFT==51) WE Compute Factor Stagel=1; 
if (counter FFT==112) WE Compute Factor Stage2=1; 
if (counter FFT==117) WE Compute Factor Stage3=1; 
if (counter FFT==256) WE Bit Reversing Last Stage=1; 
counter FFT<=counter FFT+1; //< !!1! 
end 
//Kkeep in mind that the controller of each stage needs 2 clocks from the time it is 
triggered 
// to the time that can handle the inputs. 
end 
endmodule 


A2.II. Reversing Bit of Last Stage Module 


‘timescale Ins / Ips 

module Reversing Last Stage(CLK,WE Bit Reversing,Address of Output Result); 
input CLK; 

input WE Bit Reversing; 

reg [7:0] order Inp Memory=0; 

output reg [6:0] Address of Output Result=0; 

always @(posedge CLK) 
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begin 
//I have to proceed in bit reversing in every stage. It is convenient to change the Input 
//Address(reversing the bit) before the data are trasnfered into RAM of the next stage. 
// | have to manipulate the 6bit . I have to make a bit 
// reversal based on Radix4(four decimal system) => for example :300 -> 003 , 
// 012->210 ete. 
if (WE_ Bit Reversing) 
begin 
if (order Inp Memory==64) 
begin 
order Inp Memory=0; 
end 
Address of Output Result 
={order Inp Memory[6],order Inp Memory[1],order Inp Memory[0],order Inp Mem 
ory[3],order_ Inp Memory[2],order Inp Memory[5],order_Inp Memory|4]}; 
order Inp Memory=order_Inp Memory+1'bl; 
end 
end 
endmodule 


A2. ITI. Controller of First Stage Timescale Ins / Ips 


module Controller of BF stagel(CLK,RESET,WE,WE BF for multiplexer, 
order Memoryl, order _Twiddle9, order Inp Memory1); 


//based on - Fast Fourier Transform [64FFT] Univ. of the Ryukyus, Okinawa, Japan 
//from the moment the WE _basic_stagel is on (1), the Sampled Input signal must be 
//imported on the second following CLK, meaning: 

// WE_basic_stagel is on at CLKO 

// Ram input sampled_ signal will be active at CLK2 

input CLK,RESET,WE; 

output reg [6:0] order Memory1;//7bit each piece(due to 64 present) 

output reg [5:0] order_Twiddle9;//6bit each piece(due to 64 present) 

output reg [6:0] order Inp Memory1;//7bit each piece 

output reg WE BF for multiplexer; 

reg [6:0] order Memory; 

reg [5:0] 

order Twiddle,order Twiddlel,order Twiddle2,order Twiddle3,order Twiddle4,order_ 
Twiddle5,order Twiddle6,order Twiddle7,order_Twiddle8; 

reg [6:0] order Inp Memory; 

parameter stage=1; 

parameter N=64; 

reg [5:0] counter=5'b00000; 

reg [7:0] count for each BF=8'b0; 

reg [6:0] inpA=7'bO; 
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reg [5:0] counter2=6'b010010;//if I need 18 clks delay ??°//mallon den 8a xreiastw to 
order stage 2 Memory 
reg [7:0] changer1=64 
//in stage 1, Controller needs 1 clock in order to give the right output 
always @(posedge CLK) 
begin 

case (stage) 

1: 

begin 
//order_Memory:generates the output address for stage] Memory: 0,16,32,48 - 1,17,33,49 
//keep in mind that order Memory has to take into consideration, in which place 
/As your data in RAM. First 64 slots or next 64 slots ((changer1: task) 
//order_Twiddle: generates the output address for stagel Twiddle: 0,0,0 - 0,1,2 - 0,2,4 - ... 
//order_Inp Memory:generates the input address for stagel Memory: 0,1,2,3,4.....63 - 
64,65..127 

if (WE) 

begin 


if GnpA==127) inpA<=0; 
inpA<=inpA+1; 
order Memory<=count_ for each BF+counter+changer1; 
order Inp Memory<=inpA; 
if (RESET) 
begin 
counter<=5'b00000; 
count for each BF<=0; 
counter2<=13; 
inpA<=0; 
end 


if ((anpA)>=63 )&(inpA<127)) changerl<=0; // changer] 1s responsible 
for be the pointer in RAM for the two 

else changer!<=64; 

if (count for each BF==0) 

begin 

order_Twiddle<=32'b0 1 OO0000000000000000000000000000; 

WE BF for multiplexer <=]; 

end 

if (count_for each BF==16) 

begin 

order Twiddle<=counter; 

WE BF for multiplexer <=0; 

end 

if (count_for each BF==32) 

begin 
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order Twiddle<=counter<<]1 ;//b2<=counter*2; 
WE BF for multiplexer <=]; 
end 
if (count_for each BF==48) 
begin 
order Twiddle<=(counter<<1)+(counter);//b3<=counter*3; 
WE BF for multiplexer a 
end 
count for each BF<=count for each BF+16; 
if (count_for each BF==48) 
begin 
count for each BF<=0; 
counter<=counter+1; 
counter2<=counter2+ 1; 
if (counter==15) 
begin 
counter<=0; 
end 
// the following 2 1f statements are for the order Stage2_ Memory 
if (counter2==15) 


begin 
counter2<=0; 
end 
end 
end 
end 
endcase 


end 


//One stage registers for the commands 

always @(posedge CLK) 

begin 

order Inp Memoryl=order Inp Memory; 

order Memoryl=order_ Memory; 

order Twiddlel=order_Twiddle; 

end 

// The Wn_real and Wn_im are by-passing the "Compute Factor". So I have to "delay" 
the Wn the same 

// amount of clocks as if they were imported to "Compute Factor". In order to avoid 
confusing 

// parameter in Controler and in order to avoid using pipelined registers for the Wn in 
Ram, 


//Register for Order_Twiddle 
always @(posedge CLK) 
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begin 
order Twiddle2<=order_Twiddlel; 
end 
always @(posedge CLK) 
begin 
order Twiddle3<=order_Twiddle2; 
end 
always @(posedge CLK) 
begin 
order Twiddle4<=order_Twiddle3; 
end 
always @(posedge CLK) 
begin 
order Twiddle5<=order_Twiddle4; 
end 
always @(posedge CLK) 
begin 
order Twiddle6<=order_Twiddle5; 
end 
always @(posedge CLK) 
begin 
order Twiddle7<=order_Twiddle6; 
end 
always @(posedge CLK) 
begin 
order Twiddle8<=order_Twiddle7; 
end 
always @(posedge CLK) 
begin 
order Twiddle9<=order_Twiddle8; 
end 
endmodule 


A2.1V. RAM 


“timescale Ins / Ips 

module Ram _ editionl stagel( 
order Memory, 
order Twiddle, 
Ain_real,Ain 1m, 
Win reall,Win iml, 
Ram input sampled signal real0, 
Ram input sampled signal im0, 
addr_Inp, 
WE, 
CLK 


a/ 


); 


// Includes both Twiddle Rom and Memory RAM for each stage 

// This Edition has | Real and | Image Part, does not include the "twiddle ROM" 
// and actually has 2 128N RAMs constructed by Coregen (based exactly on the Xilinx 
Library 

// RAMB16_S836_S36) 

input [31:0] Ram_ input sampled signal real0,Ram input sampled_ signal im0; 
input [6:0] order Memory;// gives the address of the output that we need 

input [5:0] order Twiddle;//gives the address of the twiddle that we need 

input [6:0] addr _Inp;// gives the Input address that we had to store the values 
input CLK; 

output [31:0] Ain real,Ain 1m; 

output [31:0] Win _reall,Win im1; 

wire [31:0|twiddle real [63:0]; 

wire [31:0]twiddle im [63:0]; 

wire [2047:0] twiddle real _A,twiddle im A; 

input WE; 

Ram Schemat stagel ed2 

AA2(CLK,addr Inp,Ram input sampled signal im0,Ram input sampled. signal realQ, 
order Memory,WE, Ain 1m,Ain real); 

MITTAL 

//the following are the 64 32bit numbers twiddle compute by matlab 


assign 

twiddle real_ A=2048'b01000000000000000000000000000000001111111011000100011 
011010001110011111011000101001011111001111100111101001111101000001010101 
101001110110010000011010111100111100011100001110001011001011110001100110 
101001101101100110001010010001100010111100100000000110101100010110101000 
001001111001100110000101000100110011110011001001010001000111000111001110 
11001110011000111100010101101011101001 110000001 1000011111011110001010100 
1100001001010010100000001 10001011100000110001111100010111000001111000000 
110010001011110100110101111000000000000000000000000000000001111100110111 
010000101100101000011110011100000111010001111100001111011010110101111111 
001110100011110011110000010000111010101100111100001110101001010001011000 
111110111000111000110001001100011001101011101100110000110011011010111010 
010101111101100001100110011110011101000011011111111001010011100101011001 
001001100111010110111000111100011101001101000011100110001001101111100101 
00001100001 1100001011000001011111010101001011000001001110101101000001 100 
000110000000100111011100100101 110001 10000000000000000000000000000001 1000 
00001001110111001001011100011000001001110101101000001 1000001 100001011000 
001011111010101001011000100110111110010100001100001110001111000111010011 
010000111001100101011001001001100111010110111001110100001101111111100101 


58 


OOLLIOLOOLOLOLLILLILOLLOOOOLLOOLLOOLLILLOLOLLLOLLOOLLOOOOLLOOLLOLLOLOLLLOLLI 
10001110001 1000100110001 10011100001 11010100101000101 10001 111110011110000 
OLOOOOLLLIOLOLOLLOOLILLIOLLOLOLLOLOLL111110011101000111110011 1000001110100 
OLILILOOOOLLILIIIOOLLOLLLOLOOOOLOLLOOLOLOOOOLLLLLLIIIIIIIIIIIIIIIIIIIIIT 
111000001 100100010111101001 1010111100001 100011111000101 1100000111 1000010 
01010010100000001 1000101 1100001 10000111110111100010101001 100001111000101 
01101011101001 110000010001 110001110011101100111001 1001010001001 100111100 
110010010100010110101000001001 1110011001 100001 100010111 1001000000001 1010 
110001 10101001 1011011001 10001010010001110000111000101 1001011110001 100111 
011001000001 101011110011110001111010011111010000010101011010011111011000 
LOLOOLOLIIIIOOLLILITLOOLLL111101 100010001 LOL LOLOOO1 11; 


assign twiddle im A 

=2048'b00000000000000000000000000000000000001 100100010111101001101011110 
00011000111110001011100000111100001001010010100000001 1000101 110000110000 
111110111100010101001100001111000101011010111010011100000100011100011100 
111011001110011001010001001100111100110010010100010110101000001001111001 
100110000110001011110010000000011010110001101010011011011001100010100100 
011100001110001011001011110001100111011001000001101011110011110001111010 
011111010000010101011010011111011000101001011111001111100111111101100010 
0011011010001 11010000000000000000000000000000000011111110110001000110110 
100011100111110110001010010111110011111001111010011111010000010101011010 
011101100100000110101111001111000111000011100010110010111100011001101010 
011011011001100010100100011000101111001000000001101011000101101010000010 
011110011001100001010001001100111100110010010100010001110001110011101100 
111001100011110001010110101110100111000000110000111110111100010101001100 
001001010010100000001 1000101 11000001 100011111000101110000011110000001 100 
100010111101001101011110000000000000000000000000000000011111001101110100 
001011001010000111100111000001110100011111000011110110101101011111110011 
101000111100111100000100001110101011001111000011101010010100010110001111 
101110001110001100010011000110011010111011001100001100110110101110100101 
011111011000011001100111100111010000110111111110010100111001010110010010 
011001110101101110001111000111010011010000111001100010011011111001010000 
11000011100001011000001011111010101001011000001001110101101000001 1000001 
10000000100111011100100101110001 10000000000000000000000000000001 10000000 
1001110111001001011100011000001001110101101000001 1000001 10000101 10000010 
111110101010010110001001101111100101000011000011100011110001110100110100 
001110011001010110010010011001110101101110011101000011011111111001010011 
101001010111110110000110011001111010111011001100001100110110101110111000 
111000110001001100011001110000111010100101000101100011111100111100000100 
001110101011001111011010110101111111001110100011111001110000011101000111 
110000111111001101110100001011001010000; 


generate 
genvar e; 
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for (e=1;e<65;e=e+1) 
begin: L1 
assign twiddle _real[64-e][31:0]=twiddle real A[(e*32-1):((e- 


1)*32)]; 
assign twiddle im[64-e][31:0]=twiddle im A[(e*32-1):(e-1)*32]; 
end 
endgenerate 
assign Win _reall=twiddle reall {1'bO,order_ Twiddle[5:0]} ];//here 
you have to put the twiddle data 
assign Win im1l= twiddle im[{1'b0,order Twiddle[5:0]! ]; 
endmodule 


A2.V. Multiplexer and Voter of Each Stage 


‘timescale Ins / Ips 

module Multiplexer A(clk,Input_realA,Input_ imageA,Input_realB,Input_imageB, 
WE BF for multiplexer,Out_real,Out_im); 

input [31:0] Input _realA,Input imageA,Input realB,Input_imageB; 

input clk,WE BF for multiplexer; 

output reg [31:0] Out_real; 

output reg [31:0] Out_1m; 

always (@(posedge clk) 


begin 
if (WE BF _ for multiplexer) 
begin 
Out_real <=Input realA; 
Out_im <=Input_imageA; 
end 
else 
begin 
Out_real <=Input_realB; 
Out_im <=Input_imageB; 
end 
end 
endmodule 
A2.VI. Compute Factor 


‘timescale Ins/1ps 


//11-12-09 32bit input, N=64, Radix4 

module Compute Factor(clk,WE BF,Input real,Input image,factorA,factorB,Aout real, 
Aout _im,WE compute factor); 

input clk,WE compute factor; 

input [31:0] Input real,Input_image; 
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output reg [31:0] Aout real,Aout im ; 

output reg [31:0] factorA,factorB; 

output reg WE _ BF;//gives the order to BF multiplier to start the procedure. 
reg [31:0] factorl a,factor2 a,factor3 a,factor4 a,factorS a,factor6 a; 

reg [31:0] factor! b,factor2 b,factor3 b,factor4 b,factorS b,factor6_b ; 


reg [1:0] counter=0; 
reg [1:0] uncounter=0; 
reg [31:0] Ain real,Ain 1m,Bin_ real,Bin 1m,Cin_real,Cin_1im,Din_real,Din_1m ; 
reg stage1=0; 
reg stage2=0; 
reg [31:0] Areal _a,Aim_a; 
reg [31:0] Areal _b,Aim_b; 
reg [31:0] t1,t2,t3,t4,t5,t6,t7,t8; 
reg [31:0]Aoutreal reg 0,Aoutim reg 0; 
reg [31:0]Aoutreal reg 1,Aoutim reg 1; 
reg [31:0]Aoutreal reg 2,Aoutim reg 2; 
reg [31:0]Aoutreal reg 3,Aoutim reg 3; 
reg [31:0]Aoutreal reg 4,Aoutim reg 4; 
reg [31:0]Aoutreal reg 5,Aoutim reg 5; 
reg [31:0]Aoutreal reg 6,Aoutim reg 6; 
reg [31:0]Aoutreal reg 7,Aoutim reg 7; 
reg [31:0]Aoutreal reg 8,Aoutim reg 8; 
reg [31:0]Aoutreal reg 9,Aoutim reg 9; 
reg [31:0]Aoutreal reg 10,Aoutim reg 10; 
reg [31:0]Aoutreal reg 11,Aoutim reg 11; 
reg [31:0]Aoutreal reg 12,Aoutim reg 12; 
reg [31:0] Real [3:0]; 
reg [31:0] Im [3:0]; 
always (@(posedge clk) 
begin 
//The 'counter' pairs the 4 input and gives the order to the 'stagel' to start 
//the calculation of the needed factors 
if (WE compute factor) 
begin 
if (counter==2'b00) 
begin 
Ain_real<=Input_real; 
Ain_im<=Input_ image; 
stage 1<=0; 
end 
else 1f (counter==2'b01) 
begin 
Bin_real<=Input real; 
Bin_im<=Input_image; 


Se ee 
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end 

else if (counter==2'b10) 
begin 
Cin_real<=Input_real; 
Cin_im<=Input_image; 
end 

else if (counter==2'b!1 1) 
begin 
Din_real<=Input_real; 
Din_im<=Input_ image; 


stagel<=1; 
end 
counter=counter+ 1; 
end 
end 
always (@(posedge clk) // We compute all the factors here and also the Input and output 
el that doesn't need multiplier. (Areal,Aim) 
// In every 4 pairs due to Radix4, the one pair doesn't need multiplier 
begin 
if (stagel) 
begin 
factorl a <= Ain _real+Bin_ im; 
factor! b <= Cin_real+Din_im; 
factor2 a <= Ain _real+Cin_ real; 
factor2 b <= Bin _real+Din_real; 
factor3 a <= Ain_real+Din_1m; 
factor3_b <= Cin_real+Bin_ im; 
factor4 a <= Ain im+Din_ real; 
factor4 b <= Cin im+Bin real; 
factors a <= Ain im+Cin 1m; 
factors b <= Bin 1m+Din im; 
factor6 a <= Ain im+Bin real; 
factor6 _b <= Cin im+Din_ real; 
Areal a <=Ain_real+Bin real; 
Areal b <=Cin_real+Din_real; 
Aim a <=Ain im+Bin_ 1m; 
Aim_b <=Cin_1im+Din_ im; 
stage2<=1; 
end 
end 


always (@(posedge clk) 
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begin 
tl<=Areal_atAreal_b; 
t2<=Aim_at+Aim }b; 
t3<=factorl a-factorl_b; 
t4<=factor4 a-factor4 b; 
t5<=factor2 a-factor2_b; 
t6<=factorS a-factorS_b; 
t7<=factor3_a-factor3_b; 
t8<=factor6 a-factor6_b; 

end 


always (@(posedge clk) 


begin 
WE BF<=0; 
if (stage2) 
begin 
if (uncounter==2'b01) 
begin 
Aoutreal reg 0 <=t1; 
Aoutim reg 0 <=t2; 
WE BF<!1; 
end 
else if (uncounter==2'b10) 
begin 
factorA <=13; 
factorB<=t4; 
WE BF<!1; 
end 
else if (uncounter==2'b1 1) 
begin 
factorA <=t5; 
factorB<=t6; 
WE BF<!1; 
end 
else if (uncounter==2'b00) 
begin 
factorA <=t7; 
factorB<=t8; 
WE BF<!1; 
end 
uncounter=uncounter+ 1; 
end 
end 


//my BF multiplier] needs 10 cycles in order to send the input to output 
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// that's why I need to keep my Aim, Areal for 11 cycles. #1 1]| 
//register1 for Areal,Aim 
always (@(posedge clk) 


begin 
Aoutreal reg 1 <= Aoutreal_ reg 0; 
Aoutim reg 1 <= Aoutim_ reg 0; 
end 
always (@(posedge clk) 
begin 
Aoutreal reg 2 <= Aoutreal reg 1; 
Aoutim reg 2 <= Aoutim reg 1; 
end 


always @(posedge clk) 


begin 
Aoutreal reg 3 <= Aoutreal_ reg 2; 
Aoutim reg 3 <= Aoutim_reg 2; 
end 
always (@(posedge clk) 
begin 
Aoutreal reg 4 <= Aoutreal reg 3; 
Aoutim reg 4 <= Aoutim_reg_ 3; 
end 
always (@(posedge clk) 
begin 
Aoutreal_ reg 5 <= Aoutreal reg 4; 
Aoutim reg 5 <= Aoutim_ reg 4; 
end 
always (@(posedge clk) 
begin 
Aoutreal reg 6 <= Aoutreal reg 5; 
Aoutim reg 6 <= Aoutim_reg_ 5; 
end 
always @(posedge clk) 
begin 
Aoutreal_reg 7 <= Aoutreal reg 6; 
Aoutim reg 7 <= Aoutim_reg_ 6; 
end 


64 


always (@(posedge clk) 


begin 
Aoutreal reg 8 <= Aoutreal reg 7; 
Aoutim reg 8 <= Aoutim_reg_ 7; 
end 
always (@(posedge clk) 
begin 
Aoutreal reg 9 <= Aoutreal reg 8; 
Aoutim reg 9 <= Aoutim_reg 8; 
end 
always (@(posedge clk) 
begin 
Aoutreal reg 10 <= Aoutreal reg 9; 
Aoutim reg 10 <= Aoutim_reg 9; 
end 
always (@(posedge clk) 
begin 
Aoutreal reg 11 <= Aoutreal_ reg 10; 
Aoutim reg 11 <= Aoutim_reg 10; 
end 
always (@(posedge clk) 
begin 
Aout_real <= Aoutreal reg 11; 
Aout im <= Aoutim reg 11; 
end 
endmodule 
A2.VII. BF’s Multiplier 


‘timescale Ins / Ips 

//02-01-10 32bit input, N=64, Radix4 

module BF multiplier] 

(clk,reset,clk enable,Wn_ reall,Wn_ im1,factorA,factorB,outl real,outl 1m); 
input [31:0] Wn_reall,Wn im] ; 

input [31:0] factorA,factorB ; 

input clk,clk_enable,reset; 

// clk enable = WE_BF from Compute Factor. 

// only if it is enabled, the multiplier functions are going to worked. 

// otherwise, every 4 signals, one is going to run through the multiple pipeline 
// registered insted of the multipliers 

output reg [31:0] outl real; 
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output reg [31:0] outl 1m; 
wire [31:0] outl real0; 
wire [31:0] outl im0; 
reg [31:0] Wn real; 
reg [31:0] Wn im0; 
reg [31:0] factorA1; 
reg [31:0] factorB1; 
wire [31:0] Sumla ; 
wire [31:0] Sumlb ; 
wire [31:0] Sum2a ; 
wire [31:0] Sum2b ; 


SS ne eee | 


//using the traditional way as described in N.Gkikas thesis 

// each time computes only one pair of output. 

// So every time uses the 4 multipliers of virtex2 

// in every clk accepts new input for computing the | pair of FFT stage. 
//11-12-09 32bit input, N=64, Radix4 


//register | 
always @(posedge clk) 
begin 
if (clk_enable) 
begin 
Wn real0 <=Wn reall; 
Wn_im0 <=Wn im]; 
factorA 1 <=factorA; 
factorB 1 <=factorB; 
end 


end 

// Based on the following procedure 

// for (n=0;n<3;n=n+1) 

// begin 

//msb_keeper_real[63:0]= factor[1+n][31:0]*Wn_real[n][31:0] - 
factor[4+n][31:0]*Wn_im[n][31:0]; 

//msb_keeper_im[63:0]= factor[4+n][31:0]*Wn_real[n][31:0] + 
factor[1+n][31:0]*Wn_im[n][31:0]; 


// out] real=msb_ keeper real [63:32]; 
// outl im =msb_ keeper im [63:32]; 
/| end 


//mult 32x32 editl(clk,reset,clk enable,Ainput,Binput,Sumfinal) 


mult 32x32 edit2 11(clk,reset,clk enable,factorAl,Wn_real0,Sum1a); 
mult 32x32 edit2 12(clk,reset,clk enable,factorB1,Wn im0,Sum1b); 


66 


mult 32x32 edit2 13(clk,reset,clk enable,factorB1,Wn_real0,Sum2a); 
mult 32x32 edit2 14(clk,reset,clk enable,factorAl,Wn_ im0,Sum2b); 
assign out] real0=SumlatSum1b; 

assign out] im0 =Sum2a-Sum2b; 


//register 3 

always (@(posedge clk) 

begin 
if (clk_enable) 
begin 
outl real<=outl_real0; 
outl im <=outl im0; 
end 

end 

endmodule 


A2. VIII. Multiplier 32x32 Module 


‘timescale Ins/100ps 

// This program is a 32x32 bit multiplier that uses the MULT18x18S multipliers of 

// Virtex2 and send the 3 [..]bit results to the CSA. The multiple CSA give us 2 outputs 

// that we send afterwards to the modified CLA in order to take the final result. 

// we concatenate the final result in order to have an 32bit output the results were checked 
// and are correct 


// multiplier Edition 3 

module mult 32x32 edit2(clk,reset,clk enable,Ainput,Binput,Sumfinal); 
input [31:0] Ainput,Binput; 

input clk,clk_enable,reset; 

wire [63:0] Sumfinal_ not_ready; 

output reg [31:0] Sumfinal;//reg 

wire Areg sign,Breg sign; 

reg [35:0] Ain,Bin; 

wire [35:0] Ain0,Bin0; 

reg [17:0] Ainl=18'b0; 

reg [17:0] Ain2=18'b0; 

reg [17:0] Binl=18'b0; 

reg [17:0] Bin2=18'b0; 

reg [17:0] Ainl part2,Ain2_ part2,Binl part2,Bin2_part2; 
wire [16:0] C1,Cla; 

wire [10:0] CO; 

wire [63:0] Addition1, Addition2, Addition3; 
wire [63:0] Sum1,Sum2,Sum3; 

reg [63:0] Additionla, Addition2a, Addition3a; 
wire [35:0] telos1,telos2,telos3,telos4; 

reg [35:0] a telos3,a_telos4; 

assign C0O=1 1'b00000000000;//10 


67 


assign 
Ain0=Ainput[31]? {3'b111,Aimput[3 1:17], 1"b0,Ainput[ 16:0]!: {4'b0000,Ainput[ 30:17], 1'b 
0,Ainput[ 16:0]; 
assign 
Bin0=Binput[3 1]? {3'b111,Binput[31:17],1'b0,Binput[16:0]}:{4'b0000,Binput[30:17],1"b 
0,Binput[16:0]?; 
always (@(posedge clk) 
begin 
Ainl<={1'b0,Ain0[16:0]}; 
Binl<={1'b0,BinO[16:0]}; 
Ain2<=Ain0[35:18]; 
Bin2<=Bin0[35:18]; 
end 
MULT18X18S AOBI1 (.A(Ain1), 
.B(Bin2), 
.C(clk), 
.CE(clk_ enable), 
.R(reset), 
.P(telos3));//anti 71 
MULT18X18S AIBO (.A(Ain2), 
.B(Bin1), 
.C(clk), 
.CE(clk_ enable), 
.R(reset), 
.P(telos4)); 


always (@(posedge clk) 
begin 
a_telos3<=telos3[35:0]; 
a_telos4<=telos4[35:0]; 
Ainl_part2<=Ain1; 
Binl_ part2<=Bin1; 
Ain2_part2<=Ain2; 
Bin2_part2<=Bin2; 
end 
MULT18X18S AOBO (.A(Ain1_part2), 
.B(Bin1_part2), 
.C(clk), 
.CE(clk_ enable), 
.R(reset), 
.P(telos1)); 
MULT18X18S AIBI1 (.A(Ain2_part2), 
.B(Bin2_part2), 
.C(clk), 
.CE(clk_ enable), 
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.R(reset), 
.P(telos2));//anti 71 


assign Cl=a_telos3[35]?17'b11111111111111111:17'bOOOOOOOO000000000; 
assign Cla=a_telos4[35]?17'b11111111111111111:17'bOOOOOOOOOOO0000000; 
assign Additionl={telos2,telos1[33:6]}; 

assign Addition2={Cl,a_telos3,CO0}; 

assign Addition3={Cla,a_telos4,C0O}; 


always (@(posedge clk) 


begin 
Addition la<=Addition1; 
Addition2a<=Addition?2; 
Addition3a<=Addition3; 
end 


csa_ for multiplier beh pip32x32 a 
aat(Additionla,Addition2a,Addition3a,Sum1,Sum2,clk); 
//there 1s a register in csa , thats why I dont use one here. 


multiplier behav_pip a aaw (Sum1,Sum2,Sum3,clk); 
//there 1s a register in CLAH , thats why I dont use one here. 


always (@(posedge clk) 


begin 
Sumfinal<=Sum3[55:24];//final_not_ready[55:24]; 
end 

endmodule 

A2.IX. BF Multiplier—CSA module 


‘timescale Ins / Ips 

//CSA . We create a 64 CSA that have 3 inputs and 2 outputs. 

//We collect the inputs from the 4 multipliers of virtex2 MULT18x18S 

//We send the two outputs to the Carry lLookahead adder of the 
multplier behav pip 32x23 


module csa for multiplier beh p1ip32x32 a(Ain,Bin,Cin,Output!,Output2,CLK2); 
input [63:0] Ain, Bin,Cin; 
output reg [63:0] Output] ,Output2; 
input CLK2; 
reg [63:0] Cout,Sum; 
integer 1; 
always @(posedge CLK2) 
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begin 

// Sum[1 ]=Ain[1]+Bin[ 1}; 

// Cout[ 1 ]=1'b0; 
for (G=0;1<64;1=1+ 1) 
begin :L1 
Cout[1]<=(Ain[i]&Bin[i])|(Ain[i]&Cin[i])|(Bin[1]&Cin[i]); 
Sum[i]<= Ainfi]“Bin[i]*Cin[i]; 
end 

end 

always @(Sum) 
begin 
Output2[63:0|<=Sum[63:0]; 
Output1[63:0|<={Cout[62:0],1'b0}; 
end 

endmodule 


A2.X. BF Multiplier—CLAH module 


‘timescale Ins / Ips 

module multiplier behav_pip a(Ain,Bin,Summary,CLK1); 
input [63:0] Ain,Bin; 

output [63:0] Summary; 

input CLK1; 

wire CO; 

wire [15:0] Gener,Propagate ; 

reg [3:0] Propagate3stagel,Gener3stagel ; 

wire [3:0] Propagate3stage,Gener3 stage; 

wire [11:0] doesntmatter ; 

wire useless, useless 1; 

wire [7:0] doesntmatter]; 

reg [63:0] Ainl,Ain2,Ain3,Ain4,Bin1,Bin2,Bin3,Bin4; 

reg [15:0] Gener! ,Gener2,Gener3,Propagate!,Propagate2,Propagate3 ; 
reg [3:0] Cin; 

reg [15:0] Cjfinal; 

wire [11:0] CjstageS; 

wire [2:0] Cin4stage; 

assign CO=0; 


//stage 1 
generate 
genvar e; 
for (e=0;e<16;e=e+1) 
begin: L1 
//assign A= 


70 


//assign B=Bin[3+e*4:0+e*4]; 


alul ul (Ain[3+e*4:0+e*4],Bin[3+e*4:0+e*4],CO,Gener|[e], Propagate|[e]); 
end 
endgenerate 


//Register 1 
always @(posedge CLK1) 

begin 
Gener 1[15:0]<=Gener| 15:0]; 
Propagate 1[15:0|<=Propagate[ 15:0]; 
Ain1[63:0]<=Ain[63:0]; 
Bin1[63:0|<=Bin[63:0]; 


end 
//stage 2 
generate 
genvar k; 
for (k=0;k<4;k=k+1) 
begin :L2 


CLAH name1(Gener1[3+k*4:0+k*4],Propagate1[3+k*4:0+k*4],CO, Gener3stage[k], 
Propagate3stage[k], doesntmatter[2+k*3:0+k*3]); 


end 
endgenerate 


// Register2 
always @(posedge CLK1) 
begin 


Gener3stage1[3:0]<=Gener3stage[3:0]; 

Propagate3 stage 1[3:0|<=Propagate3stage[3:0]; 

Gener2[ 15:0|<=Gener1[15:0]; //we keep them in order to use them on stage 4 
Propagate2[15:0|<=Propagate1[15:0];//we keep them in order to use them on stage 4 
//Cjy_reg2[3:1]=doesnmatter[2:0];//stores only the Cj from the first CLAH ??? 
Ain2[63:0]<=Ain1[63:0]; 

Bin2[63:0]<=Bin1[63:0]; 

end 

//stage 3 

CLAH name2(Gener3stage1[3:0],Propagate3stage1[3:0],CO,useless,useless 1, 
Cin4stage[2:0]); 


// Register3 
always @(posedge CLK1) 
begin 
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Cin1[3:0]<={Cin4stage[2:0],CO};//Ainput as the Cin to the CLAH of stage 4-Cin for 
CLAH | is 0 (we dont have Cin in start) 

Gener3[15:0|<=Gener2[15:0]; //we keep them in order to use them on stage 4 
Propagate3[15:0|<=Propagate2[15:0];//we keep them in order to use them on stage 4 
Ain3[63:0]<=Ain2[63:0]; 

Bin3[63:0]<=Bin2[63:0]; 


end 

//stage 4 

generate 
genvar 1; 
for (=0;1<4;1=1+1) 
begin :L3 


CLAH name3 (Gener3[3+1*4:0+1*4],Propagate3[3+1*4:0+1*4], Cin1[1],doesntmatter1[1], 
doesntmatter 1 [1+4],Cystage5[2+1*3:0+1*3]); 


end 
endgenerate 


// Register 4 
always @(posedge CLK1) 
begin 


Cjfinal[ 15:0|<={Cjstage5[11:9],Cin1[3],Cjstage5[8:6],Cin1[2],Cjstage5[5:3],Cin1[1],Cys 
tage5[2:0],Cin1[0]}; 

Ain4[63:0]<=Ain3[63:0]; 

Bin4[63 :0]<=Bin3[63:0]; 


end 
//stage 5 
generate 
genvar p; 
for (p=0;p<16;p=pt1) 
begin :L4 
alu2 u2 (Ain4[3+p*4:0+p*4],Bin4[3+p*4:0+p*4],Cjfinal[p],Summary|[3+p*4:0+p*4]); 
end 
endgenerate 


endmodule 
TAA AAA AA AAA AAA AL 
module alul (A,B,C0O,Gener,Propagate); 
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//parameter n=4; 
input CO; 
input [3:0] A,B;//n-1 
output reg Gener; 
output reg Propagate; 
wire [4:1] g; //n 
wire [4:1] prop; //n 
and andO (g[1],A[0],B[0] 
and and! (g[2],A[1],B[1] 
and and2 (g[3],A[2],B[2] 
and and3 (g[4],A[3],B[3]); 
xor xorl(prop[1],A[0],B[0]); 
xor xor2(prop[2],A[1],B[1]); 
xor xor3(prop[3],A[2],B[2]) 
xor xor4(prop[4],A[3],B[3]) 
always@(A,B,CO) 
begin 

Propagate=prop| 1 |&prop[2|&prop[3|&prop[4];//1 didnt put the CO 

Gener= 
i i i a a i a 

en 


9 
9 


9 


ws ee NL 


9 
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endmodule 
LTT TTT TL / 


module alu2 (A,B,C0O,Sum); 
input CO; 

input [3:0] A,B ;//n-1 
output reg [3:0] Sum ;//n-1 
reg Cout; 

always@(A,B,CO) 


begin 
{Cout,Sum}= A+B+C0; 
end 
endmodule 


ILA 


module CLAH(G,P,C0,Generatea,Propagate,C}); 
input CO; 

input [4:1] G,P; 

output [3:1] Cj; 

output Generatea,Propagate; 

assign Propagate=P[1|&P[2|&P[3]&P[4]; 


73 


assign Generatea=G[1]&P[2]&P[3]&P[4]|G[2]&P[3 |&P[4]|G[3 |&P[4]|G[4]; 
assign Cj[ 1 |=(P[1]&Co0)|G[1]; 

assign Cj[2}=G[2]|(P[2]&G]1])|(P[2]&P[1]&Co); 

assign Cj[3}=G[3]|(P[3 ]&G[2])|(P[3 |]&P[2]&Gf1])\(P[3]&P[2]&P[ 1 ]&Co); 
endmodule 
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APPENDIX B. TMR IMPLEMENTATION 


This appendix includes the TMR version of the 64-point, in-place, Radix-4, DIF 
FFT design as it 1s described in Chapter [V.B.1. All arithmetic calculations are protected 
by triple redundancy. The module receives two fixed-point 32-bit inputs (real and 
imaginary) and outputs two fixed-point 32-bit results. In the last step of each FFT’s 
stage, the three computed values are inspected by a voter, based on the principle of triple 
redundancy. The design failed to be implemented in a Virtex II XC2V6000 BF957 and 
was finally implemented in a Virtex I] XC2V8000 S5FF1152 using Xilinx tools. The 
synthesis report from the XC2V6000 confirmed that the design needed 85 cycles per 
stage, a total of 255 cycles of latency at a clock speed of 134MHz. 


In this appendix, only the schematic and the structural or behavioral modules that 
were changed or added after the creation of the basic FFT design (Appendix A) are 


illustrated. 
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Bl. TMR MODULE—XILINX SCHEMATIC DESIGN 
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Figure 22. TMR’s Entirely Concept Layout. 
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Figure 23. TMR’s First Stage Layout. 
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Figure 24. TMR’s BF Module 


B2. TMRMODULE—XILINX BEHAVIORAL AND STRUCTURAL DESIGN 
B2.1. Multiplexer and Voter of Each Stage 


‘timescale Ins / Ips 
Module MultiplexerA_TMR(clk,Inputl realA,Inputl imageA,Inputl_ realB, 
Input] imageB, Input2_realA, Input2 imageA,Input2_ realB,Input2 imageB, 
Input3_realA, Input3 imageA,Input3_ realB,Input3_ imageB, 
WE BF for multiplexer,Out_real,Out_im); 
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input [31:0] Input] realA,Inputl imageA,Inputl realB,Inputl imageB; 
input [31:0] Input2_ realA,Input2 imageA,Input2 realB,Input2_imageB; 
input [31:0] Input3 realA,Input3 imageA,Input3 realB,Input3 imageB; 


input clk,WE BF for multiplexer; 
output [31:0] Out_real; 

output [31:0] Out_1m; 

reg [31:0] Before Voter real [2:0]; 
reg [31:0] Before Voter im [2:0]; 
wire [31:0] OutO_real,OutO_ im; 
always (@(posedge clk) 


begin 
if ( WE BF for multiplexer) 
begin 
Before Voter _real[0] <=Inputl realA; 
Before Voter _im[0] <=Inputl imageA; 
Before _Voter_real[1 | <=Input2_realA; 
Before Voter _im[1] <=Input2_imageA; 
Before Voter _real[2| <=Input3 realA; 
Before Voter _im[2] <=Input3_imageA; 
end 
else 

begin 
Before Voter _real[0] <=Inputl_ realB; 
Before Voter _im[0] <=Inputl_ imageB; 
Before_Voter_real[ 1] <=Input2_realB; 
Before Voter _im[1] <=Input2_imageB; 
Before Voter _real[2| <=Input3_realB; 
Before Voter _im[2] <=Input3_imageB; 
end 

end 

assign 


Out0O_real=(Before_ Voter _real[0|==Before Voter _real[1])?Before Voter _real[0]:Before 
_Voter_real[ 1]; 

assign Out _real=(Out0 real==Before Voter real[2])?Out0 real:Before Voter _real[0]; 
assign 

OutO_im=(Before_Voter_im[0]==Before Voter _im[1])?Before Voter _im[0]:Before Vo 
ter im[1]; 

assign Out_1m=(Out0_1m==Before Voter _im[2])?OutO_1m:Before Voter im[0]; 
endmodule 
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APPENDIX C. RPR IMPLEMENTATION 


This appendix includes the RPR version of the 64-point, in place, Radix-4, DIF 
FFT design as it is described in Chapter [V.B.3. All arithmetic calculations of the FFT 
are protected by the reduced precision redundancy method. The module receives two 
fixed-point 32-bit inputs (real and imaginary) and outputs two fixed-point 32-bit results. 
In the last step of each FFT’s stage, the precise value and the truncated values are 
inspected by a voter, based on the principle of reduced precision redundancy. The design 
was first implemented in a Virtex IJ XC2V6000 BF957 and then in a Virtex II 
XC2V8000 SFFII52 using Xilinx tools. The synthesis report from the XC2V6000 
confirmed that the design needed 85 cycles per stage, a total of 255 cycles of latency at a 
clock speed of 163Mhz. 


In this chapter, only the schematic and structural or behavioral modules that were 


changed or added after the creation of the TMR design (Appendix B) are illustrated. 
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Cl. RPR MODULE—XILINX SCHEMATIC DESIGN 
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Figure 25. RPR- Entirely Module Layout 
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Figure 26. RPR Module—Three Stages Layout 
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Figure 28. .RPR—Butterfly 


C2. RPR MODULE—XILINX STRUCTURAL AND BEHAVIORAL DESIGN 
C2.I. Radiation Module 


‘timescale Ins / Ips 
module Radiation(clk,Radiation); 
input clk; 
output [359:0] Radiation; 
//Radiation Guide 
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// Radiation [359:0] 

// RadiationA[119:0] = Radiation [359:240] radiation of first stage 

// RadiationB[119:0] = Radiation [239:120] radiation of second stage 

// RadiationC[119:0] = Radiation [119:0] radiation of third stage 

reg [13:0] RadiationAr upper=14'b0; 

reg [13:0] RadiationBr_upper=14'b0; 

reg [13:0] RadiationCr_ upper=14'b0; 

reg [13:0] RadiationAr lower=14'b0; 

reg [13:0] RadiationBr_lower=14'b0; 

reg [13:0] RadiationCr_lower=14'b0; 

reg [31:0] RadiationAr TMR=32'b0; 

reg [31:0] RadiationBr TMR=32'b0; 

reg [31:0] RadiationCr TMR=32'b0; 

reg [13:0] RadiationA1_ upper=14'b0; 

reg [13:0] RadiationBi_ upper=14'b0; 

reg [13:0] RadiationCi_upper=14'b0; 

reg [13:0] RadiationA1 lower=14'b0; 

reg [13:0] RadiationBi_lower=14'b0; 

reg [13:0] RadiationCi_lower=14'b0; 

reg [31:0] RadiationAr TMR=32'b0; 

reg [31:0] RadiationBi_ TMR=32'b0; 

reg [31:0] RadiationCi_ TMR=32'b0; 

wire [119:0] RadiationA,RadiationB,RadiationC; 

reg [1:0] counter=2'b00; 

reg [2:0] random=3'b001; 

parameter High=14'b001 11000000000; 

parameter Low= 14"b00000000000001; 

reg [6:0] mm=0; 

wire [13:0] Value]; 

assign Valuel=High; 

always (@(posedge clk) 

begin 

counter<=2;//counter+1; 

random<=random+3'b001; 

case(counter) 
QO: 
begin 
RadiationAr upper<=Valuel+random; 
RadiationBr_upper<=Valuel+random; 
RadiationCr_upper<=Valuel+random; 
RadiationAi_upper<=Valuel+random; 
RadiationB1_upper<=Valuel+random; 
RadiationC1_upper<=Valuel+random; 
RadiationAr lower<=0; 
RadiationBr_lower<=0; 


NEE ee ee ee | 
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RadiationCr_lower<=0; 

RadiationA1_ lower<=0; 
RadiationBi_lower<=0; 
RadiationCi1_lower<=0; 

RadiationAr TMR<=0; 

RadiationBr_ TMR<=0; 

RadiationCr TMR<=0; 

RadiationA1 TMR<=0; 

RadiationBi_ TMR<=0; 

RadiationCi_ TMR<=0; 

end 

1: 

begin 

RadiationAr_ lower<=Valuel+random; 
RadiationBr_lower<=Valuel+random; 
RadiationCr_lower<=Valuel+random; 
RadiationAi_lower<=Valuel+random; 
RadiationB1_lower<=Valuel+random; 
RadiationC1_ lower<=Valuel+random; 
RadiationAr_ upper<=0; 
RadiationBr_upper<=0; 
RadiationCr_upper<=0; 
RadiationAi_upper<=0; 
RadiationBi_upper<=0; 
RadiationCi_upper<=0; 

RadiationAr TMR<=0; 

RadiationBr_ TMR<=0; 

RadiationCr TMR<=0; 

RadiationA1 TMR<=0; 

RadiationBi1_ TMR<=0; 

RadiationCi_ TMR<=0; 


end 
2 
begin 
mm=mm-+1; 
if ((mm==4))//||(mm=—6)||(mm==7)) 
begin 
RadiationBr TMR<={(Valuel[13:11]+random), Value1[10:0],18'b0O!; 
end 
else 
begin 
RadiationCr TMR<=0;//{(Valuel+random),18'b0}; 
end 


RadiationCr_TMR<=0;//{(Valuel+random),18'b0}; 
RadiationAr TMR<=0;//{(Valuel+random),18'b0}; 
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RadiationA1_ TMR<=0;//{(Valuel+random),18'b0}; 
RadiationB1_ TMR<=0;//{(Valuel+random),18'b0!; 
RadiationC1 TMR<=0;//{(Valuel+random),18'b0!; 
RadiationAr_ upper<=0; 
RadiationBr_upper<=0; 
RadiationCr_upper<=0; 
RadiationAi_upper<=0; 
RadiationBi_upper<=0; 
RadiationCi_upper<=0; 
RadiationAr_ lower<=0; 
RadiationBr_lower<=0; 
RadiationCr_lower<=0; 
RadiationA1 lower<=0; 
RadiationBi_lower<=0; 
RadiationCi1_lower<=0; 
end 


3; 
begin 
RadiationAr upper<=Valuel+random; 
RadiationBr TMR<={(Valuel+random),18'b0}; 
RadiationCr_lower<=Valuel+random; 
RadiationAi_upper<=Valuel+random; 
RadiationB1_TMR<={(Valuel+random),18'bO}; 
RadiationC1 lower<=Valuel+random; 
RadiationBr_upper<=0; 
RadiationCr_upper<=0; 
RadiationBi_upper<=0; 
RadiationCi_upper<=0; 
RadiationAr_ lower<=0; 
RadiationBr_lower<=0; 
RadiationAi_lower<=0; 
RadiationB1_lower<=0; 
RadiationAr TMR<=0; 
RadiationCr TMR<=0; 
RadiationA1 TMR<=0; 
RadiationCi_TMR<=0; 
end 

endcase 

end 


assign 


RadiationA[119:0]={RadiationAr upper,RadiationA1 upper,RadiationAr TMR,Radiatio 
nAi TMR,RadiationAr lower,RadiationA1 lower}; 
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assign 

RadiationB[119:0]={RadiationBr_upper,RadiationBi upper,RadiationBr TMR,Radiatio 
nBi TMR,RadiationBr_ lower,RadiationB1 lower}; 

assign 

RadiationC[119:0]={RadiationCr_ upper,RadiationCi_ upper,RadiationCr TMR,Radiatio 
nCi_TMR,RadiationCr_ lower,RadiationCi_ lower}; 

assign Radiation[359:0|={RadiationA,RadiationB,RadiationC } ; 

endmodule 


C2.11. Multiplexer and Voter of Each Stage 


‘timescale Ins / Ips 

module 

Multiplexer RPR(clk,Inputl realA,Input] imageA,Inputl realB,Inputl imageB,Input2 
_realA,Input2 imageA,Input2_realB,Input2 imageB,Input3 realA,Input3_imageA,Input 
3_realB,Input3 imageB,WE BF for multiplexer,Out_real,Out_1m); 


input [13:0] Input] realA,Inputl] imageA,Inputl realB,Inputl imageB; 
input [31:0] Input2_ realA,Input2 imageA,Input2 realB,Input2_imageB; 
input [13:0] Input3 realA,Input3 imageA,Input3 realB,Input3 imageB; 
input clk,WE BF for multiplexer; 

output [31:0] Out_real; 

output [31:0] Out_1m; 

reg signed [31:0] Before Voter real; 

reg signed [31:0] Before Voter 1m; 

reg signed [13:0] Upper bond real,Upper bond im,Lower bond _real,Lower bond 1m; 
reg signed [31:0] Median_real,Median_im; 

reg signed [13:0] Bottom_ real; 

reg signed [13:0] Top real; 

reg signed [13:0] Bottom_1m; 

reg signed [13:0] Top im; 

reg signed [13:0] Precise_real; 

reg signed [13:0] Precise_image; 


wire [31:0] Out_real0,Out_1m0; 
wire K1_ 1,K2 1r,K3_ 1; 
wire K4 1,K5_r; 
wire Omega real; 
wire K1_ 1,K2 1,K3 1; 
wire K4 1,K5 1; 
wire Omega 1m; 
always (@(posedge clk) 
begin 
if (WE BF for multiplexer) 
begin 
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Upper_bond _real =Inputl_ realA; 


Upper bond im =Inputl imageA; 
Before_Voter_real =Input2_realA; 
Before Voter im =Input2_ imageA; 
Lower bond _real =Input3 realA; 
Lower bond 1m =Input3_ imageA; 
end 
else 
begin 
Upper_bond _ real =Inputl_ realB; 
Upper bond im =Inputl imageB; 
Before_Voter_real =Input2_realB; 
Before Voter im =Input2_imageB; 
Lower bond _real =Input3_realB; 
Lower bond im =Input3_imageB; 
end 


Median _ real={({Upper_ bond _real[13],Upper bond real}+{Lower bond_real[13],Lower 
_bond_real})>>1,18'bO};// means /2 

Median im={({Upper bond im[13],Upper bond im}+{Lower bond im[13],Lower_ bo 
nd_im})>>1,18'bO};//means /2 

Top _real=Upper_bond_real+14'b00000000000100; 
Bottom_real=Lower_bond_real+14'b11111111111100; 

Top im=Upper_ bond im+14'b00000000000100; 

Bottom_im=Lower_bond_ 1m+14'b11111111111100; 

Precise real=Before Voter _real[31:18]; 

Precise image=Before Voter _1m[31:18]; 

end 


//RPR Voter 

assign K1 r=(Top_real>=Bottom_real)?1'b1:1'b0; 

assign K2 r=(Precise real>Top real)?1'b1:1'b0; 

assign K3_ r=(Precise_real<Bottom_real)?1'b1:1'b0; 

and Pl (K4 r,K1_ r1,K2 1); 

and P2 (K5_r,K1_ 1,K3_1r); 

or P3 (Omega real,K4 r,K5_r); 

assign Out_real=(Omega real)?Median_ real:Before Voter_real; 
assign K1 1=(Top_im>=Bottom_1m)?1'b1:1'b0; 

assign K2 1=(Precise image>Top_ im)?1'b1:1'b0; 

assign K3_1=(Precise_ image<Bottom_im)?1'b1:1'b0; 

and P11 (K4 1,K1_ 1,K2 1); 

and P21 (K5_ 1,K1_ 1,K3_ 1); 

or P31 (Omega 1m,K4 1,K5_ 1); 

assign Out_im=(Omega 1m)?Median im:Before Voter _1m; 
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// The following example is a behavioral voter of RPR 
// But due to decrease in the speed I prefered 

// the solution of a structural design as above. 

////real voter 

/Af (Upper bond real>Lower_ bond_ real) 


// begin 

// if (Before Voter real>Top real) 

// begin 

// Out _real={Median _ real,24'b0}; 
i end 

// else if (Before Voter real<Bottom_real) 
/| begin 

// Out _real={Median_ real,24'b0}; 
/| end 

// else 

// begin 

// Out _real=Before Voter real; 

// end 

// end 

//else 

// begin 

// Out _real=Before Voter real; 

// end 

// 


///mage voter 
/Af (Upper bond 1m>Lower_ bond _ 1m) 


// begin 

// if (Before Voter 1m>Top_ im) 

i begin 

// Out_im={Median_ 1m,24'b0}; 
// end 

ie else if (Before Voter _1m<Bottom_1m) 
// begin 

// Out_im={Median_ 1m,24'b0}; 
// end 

// else 

// begin 

// Out_im=Before_Voter_1m; 
// end 

// end 

//else 

// begin 

// Out_im<=Before_Voter_1m; 


9] 


/| end 

// 

//end 

OutO_real=(Before Voter _real[0|==Before Voter real[1])?Before Voter real[0|:Before 
_Voter_real[1]; 

//assign Out_real=(OutO_real==Before_ Voter_real[2])?Out0O real:Before Voter_real[0]; 
//assign 

OutO_im=(Before_ Voter _1m[0]==Before Voter _im[1])?Before Voter _im[0]:Before Vo 
ter im[1]; 

//assign Out_im=(Out0_1m==Before_Voter_1m[2])?Out0_1m:Before_ Voter_im[0]; 
endmodule 


C2. I. BF—Delayer 


‘timescale Ins / Ips 

module Delayer RPR(Un_real,In 1m,Out_real,Out_im,clk); 
input clk; 

input [13:0] In_real,In_ im; 

output reg [13:0] Out_real,Out_1m; 


reg [13:0] Real reg 0,Im_ reg 0; 
reg [13:0] Real reg 1,Im_ reg 1; 
reg [13:0] Real reg 2,Im_ reg 2; 
reg [13:0] Real reg 3,Im_ reg 3; 
reg [13:0] Real reg 4,Im reg 4; 
reg [13:0] Real reg 5,Im_ reg 5; 
reg [13:0] Real reg 6,[m_ reg 6; 


// This delayer is for the BF outputs due to the fact that the use of 8/32 or 14/32 degree of 
RPR result a decrease in use of multipliers, CSA and CLAH and in the decrease of the 
needed pipelines. So I have to replace the loss of the 6 "cutted" pipelines with 6 registers 
in order my FFT to be synchronous. 
always (@(posedge clk) 
begin 

Real reg 0 <= In_ real; 

Im _ reg 0 <= In im; 


end 


always (@(posedge clk) 

begin 
Real reg 1 <= Real reg 0; 
Im reg 1 <= Im _ reg 0; 


end 
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always (@(posedge clk) 

begin 
Real reg 2 <= Real reg 1; 
Im_ reg 2 <=Im reg 1; 


end 

always (@(posedge clk) 

begin 
Real reg 3 <= Real _ reg 2; 
Im _ reg 3 <= Im _ reg 2; 

end 

always @(posedge clk) 

begin 
Real reg 4 <= Real reg 3; 
Im_ reg 4 <= Im _ reg 3; 


end 


always (@(posedge clk) 


begin 
Real_reg 5 <= Real reg 4; 
Im_ reg 5 <= Im _ reg 4; 
end 
always (@(posedge clk) 
begin 
Out_real <= Real reg 5; 
Out_1m <=Im reg 5; 
end 
endmodule 
C2.1V. BF— Multiplier 


‘timescale Ins/100ps 
// multiplier RPR Edition 1 
module mult 14x14 edit] RPR(clk,reset,clk_enable,Ainput,Binput,Sumfinal); 


input [13:0] Ainput,Binput; 
input clk,clk_enable,reset; 
output reg [13:0] Sumfinal; 
reg [17:0] Ain,Bin; 
reg [17:0] Ain1,Bin1; 
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wire [17:0] Ain0,Bin0; 


assign Ain0=Ainput[ 13]? {4'b1111,Ainput}: {4'b0000,Ainput} ; 
assign Bin0=Binput| 13]? {4'b1111,Binput}: {4'b0000,Binput!} ; 
wire [35:0] telos; 


always (@(posedge clk) 
begin 
Ainl<=Ain0; 
Bin1<=Bin0; 
end 


MULT18X18S AOBO (.A(Ain1), 


.B(Bin1), 
.C(clk), 
.CE(clk_ enable), 
.R(reset), 
.P(telos)); 

always @(posedge clk) 

begin 

Sumfinal<=telos[25:12]; 

end 

endmodule 


94 


APPENDIX D. MATLAB FILE 


This appendix contains the MATLAB file created to verify the results from 


Xilinx’s RPR module. This file has multiple functions such as: 


D1. 


Generates input signals, converts them into two’s complement fixed-point binary 
numbers and exports them into files for use in Modelsim. 


Imports the generated output files from Modelsim and converts them into 
decimal. 


Generates two different algorithms for calculating the FFT. It uses the 
MATLAB’s build-in FFT function and a behavioral simulation in MATLAB, 
based on the implemented Xilinx design. Simultaneously, it loads the data from 
Xilinx that is generated by the Modelsim simulator program. 


Generates three different graphical figures based on the three set of data. 


Calculates the bounds for the expected precise calculations error and for the 
radiation imported errors. 


Shows the output results for every stage of the FFT and compares the three sets of 
data, enabling the user to inspect the impact of radiation in the whole process of 
the FFT algorithm. 


MAIN FILE 


6 lt works. pertectly for |Inpuce|<2°(—-9) . 
6 It works with the later RPR designs due to the 3bit scale down of 
6 the inputs at the beginning of each stage. 


6 Radix 4 fft N=64 point 

6 programma creates twiddle for N=64 bit 

6 fixed point two's complement 

% 30 LSB are the floating. 

6 sample_signal_real and sample _signal_image contains the input 
equation 


6 this program computes the FFT Radix4 N=64point in shape geometry, 
6 reversing bit in every stage,compares the result of the calculations 


Ch 


O 


AP o\P ol? 


oe 


between my Matlab FFT design and Verilog Design. 

Athanasios Gavros 15/07/2010 

edition 1.3a 

As an input signal : real part :'equationA', imaginary part 


>'equationB'. 
6 In Phase Compare between my FFT and Verilog, you are going to observe 
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% sometimes huge difference. But that is not entirely truth. Due to 
small 

6 numbers (10%(-13) etc ..) that should be accepted equal to zero, any 
6 flunctuation can cause wrong results. That why we compare the image 


6 real part of the two outputs and if the difference is smaller than 
6 9.3*(10%-7) , Matlab is printing the message 
Phase_Checking='Correct'. " 


S6Beware that the results of Verilog are not Bit Reversed on the final 
stage. We are reversing the Address of the Output , but here in 
Matlab, we 

sSare doing this final bit reversal in order to compare our final 
results 

Swith the results of build in fft. 

clear all; 

close all; 

colordef white; 

format long 

pinakasO=cell (64, 64); 

pinakas=cell (64, 64); 

Binary_to_decimal=cell (1,64); 

clear —c- 

adder=0; 

N=64; %Spoint 

F1l=[0:N-1]/N; 


ee a a a a a, ee a, Se) S| ae fe Se ey (eee 


fe) 


6 Input signal 

LOY, m—0: L263 

egquationA (n+1)=0.5*cos (2*pi*n/64)+0.3*cos (2*pi*n/9); 

equati1one it 1)=0 47 610 (292. y/o) FO wl Sin 01 7 14) 0.3" eos (2751 ny Ss); 
end 


% RANDOM signal 
equationA (48) =0.15; 
dd 


( 
( 
( 
( 
equationaA ( 
equationaA ( 
equationaA ( 
equationaA ( 
equationaA ( 
equationbB ( 
equationB ( 
equationB ( 
equationbB ( 
equation B ( 
equationbB ( 


a 


[e} 


if (n>58) 
96 


equationA (n+1)=0.5; Send 

end 

for counter=1:1:64 
sample_signal_real(counter)= double( equationA(counter));%0.0001; 
sample_signal_image(counter)= double( equationB(counter));%0.0001; 


end 
Sstwiddle 


for mm = 0:1:63 
Sradix—4 
twiddl (mm+1)=double (exp (-2*pi*j*mm*1/64) ); 


Wn_real (mm+1)=(real (twiddl (mm+1)));%cos (2*pi*mm/64) ; 
Wn_im(mm+1)=(-imag (twiddl (mm+1)));%sin(2*pi*mm/64); 
end 


for 1=0:16:42 
if (i1==0) 
Ain_real=sample_signal_real (itp); 
Ain_im= sample_signal_image (itp); 
end 
if (1==16) 
Bin_real=sample_signal_real (itp); 
Bin_im= sample_signal_image (itp); 
end 
1£ (1==32) 
Cin_real=sample_signal_real (itp) ; 
Cin_im= sample_signal_image (itp) ; 
end 
if (1==48) 
Din_real=sample_signal_real (itp); 
Din_im= sample_signal_image(itp); 
end 

ertl=ertl1+tl1; 

checkl1 (ert1)=1i+p; 


end 

factorl = Ain _real+Bin_im-Cin_real-Din_im; 

FacLorZ = Ain _real-Bin_real+Cin_real-Din_real; 

tractor. = Ain _real-Bin_im-Cin_real+Din_im; 

factor4 = Ain _im-Bin_real-Cin_im+Din_real; 

factors = Ain_im-Bin_im+Cin_im-Din_im; 

factors = Ain _im+Bin_real-Cin_im-Din_real; 

Areal_stagel (0+p) =Ain_real+Bin_real+Cin_real+Din_real; 

Aim_stagel (0+p) =Ain_im+Bin_im+Cin_im+Din_im; 
Areal_stagel (16+p) = double(factorl*Wn_real(p) + 

factor4*Wn_im(p));% ena error sto Wn (ptl)!!! pantoy xaos 

Aim_stagel (16+p) = double (factor4*wWn_real(p) - 


factorl*Wn_im(p)); 
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Areal_stagel (32+p) = double (factor2*wWn_real (2* (p-1) +1) 
factors *wWn_im(Z* (p-1L) +L) ); 


Aim_stagel (32+p) = double (factor5*wWn_real (2* (p-1) +1) 
FACT OrZ *Wn_aim({2* (p=1)+1):); 

Areal_stagel (48+p) = double (factor3*Wn_real (3* (p-1) +1) 
LACT OLS *Wn_im(3* (p=—1L)+1)); 

Aim_stagel (48+p) = double (factor6o*Wn_real (3* (p-1) +1) 


factor3*Wn_im(3* (p-1)+1)); 


%® bit reversal of the Radix4 
6 example 003-> 300 


ttet=0; 
e2=0; 
for v2=Lil?:4 
for p2=0:4:12 
ror 12=0:16:46 
CoL=—CtcctT lL 


KZ 2 (Col )=V2ro2Z Fie 
end 

end 

end 


for 1=1:1:64 
Aim_stagel_r(k22(1))=Aim_stagel (1); 
Areal_stagel_r(k22(1))=Areal_stagel (1); 
end 


ttt=0; 
e7=03 
for v2=0:16:48 
for p2=1:1:4 
ror 12=—0:4:12 
Ctt=tttt+1; 


k22e (CCl) =v2tp2ti 2; 
end 

end 

end 


for qw=1:1:64 
Aim_stagel_rr(k22a (qw) )=Aim_stagel ((qw) ); 
Areal_stagel_rr(k22a(qw) )=Areal_stagel ((qw)); 
end 


Re 
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for p2=1: 
ror 12=0:4%¢12 
if (12==0) 


Ain_real2=Areal_stagel_r(1i2+p2+v2); 


Ain_im2= Aim_stagel_r(1i2+p2+v2); 


end 


1f (12==4) 


Bin_real2=Areal_stagel_r(i2+p2+v2) ; 


Bin_im2= Aim_stagel_r(1i2+p2+v2); 


end 


if (12==8) 


Cin_real2=Areal_stagel_r(i2+p2+t+v2) ; 


Cin_im2= Aim_stagel_r(i2+p2+v2) ; 


end 


if (12==12) 


Din_real2=Areal_stagel_r(i2+p2+v2) ; 


Din_im2= Aim_stagel_r(1i2+p2+v2); 


end 
VZtDZT1Z; & 
end 


test milestone 


factorl_stage2 = 
= Ain _real2-Bin_real2+Cin_real2-Din_real2; 


factor2_stage2 
factor3_stage2 
factor4 stage2 
factor5_stage2 
factor6o_stage2 


Areal_stage2 (v2+p2) 


Ain real2rBinaim2—-Cin_realZ—-Din_im?2; 


Ain-realzZ—-Bin-_3amZ2-Cin-_reelZ2rpin amZ: 


= Ain _im2-Bin_real2-Cin_im2+Din_real2; 


Ain_im2-Bin_im2+Cin_im2-Din_im2; 


= Ain _im2+Bin_real2-Cin_im2-Din_real2; 


=Ain_real2+Bin_real2+Cin_real2+Din_real2; 


Aim_stage2 (v2+p2) 


=Ain_im2+Bin im2+Cin_im2+Din_im2; 


Areal_stage2 (v2+p2+4) 
double (factorl_stage2*Wn_real (1+e2* 
Aim_stage2 (v2+p2+4) 
double (factor4 stage2*Wn_real (1+e2* 
Areal_stage2 (v2+p2+8) 
double (factor2_stage2*Wn_real (1+te2* 
Aim_stage2 (v2+p2+8) 
double (factor5_stage2*Wn_real (1+e2* 


4) 
4) 
8) 
8) 
Areal_stage2(v2+p24+12) = 


double (factor3_stage2*Wn_real (1+e2*12) 


factor6_stage2*Wn_im(1+e2*12)); 
Aim_stage2 (v2+p2+12) = 


double (factor6é_stage2*Wn_real (1+e2*12) 


factors StageZ*Wn_im(iteZz*12Z)); 


iteZ* 4) lel 6 ei reZ 12 
end 


99 


+ 


+ 


;otest milestone 


factor4_ stage2*Wn_im(1+e2*4)); 
factorl_stage2*Wn_im(1+e2*4)); 
factor5_stage2*Wn_im(1+e2*8)); 


factor2_stage2*Wn_im(1+e2*8)); 


= Dit reversal Of Lhe Radix4 


6 example O003-> 300 


ttt=0; 
e2=0; 
Lor V2=Liils4 
for p2=0:4:12 
for 12=0:16:48 
CttC=tttt+1; 


K2Z2 (CCL) =Vv2rp2riz; 
end 

end 

end 

for 1=1:1:64 


Aim_stage2_r(k22 (1) )=Aim_stage2 (1); 
Areal_stage2_r(k22(1))=Areal_stage2 (1); 


= a Ss a = eS a ee es es Ss eS a 


6 Input signal 


for p3=0:4:60 
for 13=1:1:4 
if (13==1) 


Ain_real3=Areal_stage2_r(13+p3); 
Ain_im3= Aim_stage2_r(1i3+p3); 


end 
if (13==2) 


Bin_real3=Areal_stage2_r(1i3+p3); 
Bin_im3= Aim_Stage2_r(13+p3); 


end 
if (13==3) 


Cin_real3=Areal_stage2_r(1i3+p3) ; 
Cin_im3= Aim_stage2_r(1i3+p3); 


end 
if (13==4) 


Din_real3=Areal_stage2_r(1i3+p3); 
Din_im3= Aim_stage2_r(13+p3); 


end 


fe) 


ool=p3+13; % test milestone 


end 


factorl_stage3 
factor2_stage3 
factor3_stage3 
factor4_ stage3 
factor5_stage3 
factoro_stage3 


Areal_stage3 (p3+1) 


Ain: realSrBaniams—-Cin:reeals—-Din_im3: 
Ain. reéal3s—-Bin realstCin _real3s—-Din_real3; 
Ain _reals=Bin_im3=Cin.1eals7+Din ims; 
Ain-im3-Bin_real3s—-Cin_im3s+Din reals: 

Ain im3=Bin_im3+Cin_im3-Din_im3; 

Ain iams7+Bin reals—Cin-_ims—-Din_real3: 


=Ain_real3+Bin_real3+Cin_real3+Din_real3; 


Aim_stage3 (p3+1) 


=Ain_ im3+Bin_im3+Cin_im3+Din_im3; 
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Areal_stage3 (p3+2) 
factor4 stage3*Wn_im(1)); 

Aim_stage3 (p3+2) 
factorl_stage3*Wn_im(1)); 


Areal_stage3 (p3+3) = double (factor2_stage3*Wn_real (1) 


factor5_stage3*Wn_im(1)); 


Aim_stage3 (p3+3) = double (factor5_stage3*Wn_real (1) 


factor2_stage3*Wn_im(1)); 


Areal_stage3 (p3+4) = double (factor3_stage3*Wn_real (1) 


factor6é_stage3*Wn_im(1))j; 


Aim_stage3 (p3+4) = double (factor6o_stage3*Wn_real (1) 


factor3_stage3*Wn_im(1)); 
end 


% bit reversal of the Radix4 
6 example 003-> 300 


ttt=0; 

e2=0; 

for v2=1:1:4 

for p2=0:4:12 
ror 12=07216:46 
ECC=CtLcerlL: 


K22 (CCC) =V2rp2FiZ; 
end 

end 

end 


for 1=1:1:64 

Sif (Aim_stagel (1) <10%(-10 
% Aim_stagel_r(k22(1))= 
sSelse 
Aim_stage3_r(k22(1))=Aim_stage3 (1); 


)) 
0; 


Areal_stage3_r(k22 (1))=Areal_stage3 (1); 


Peal SS ee ee 


X1(1:N)=sqgrt (Aim_stage3(1:N) .*%2+Areal_stage3(1:N).%2); 
subplot (3,1, 7) 

Plot (Fi.,201,;,°=—2") 

Grid 

ylabel ('Magnitude') 


title ("Radix 4 FFT N=64 my design (BEFORE REVERSED BIT)'); 


X1(1:N)=sgrt (Aim_stage3_r(1:N).*%2+Areal_stage3_r(1:N) .%2); 
subplot (3,1,8) 
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double (factorl_stage3*Wn_real (1) 


double (factor4 stage3*Wn_real (1) 


+ 


plot. (Fil,.%41,"*=2") 

Grad 

ylabel ('Magnitude') 

title ('Radix 4 FFT N=64 my design (AFTER REVERSED BIT)'); 


subplot (3,1,1) 

plot ([0:63],Areal_stagel,'-o'); 

title ('My Matlab FFT design —- Radix 4, N=64, in shape geometry, 
reversing bit in each stage - The following plot is Stagel real'); 
axis auto ; 


subplor (o7h,.2) 

plot ([0763)],Aim stagel, =o") ; 
title ('Stagel im'); 

axis auto ; 


axis auto ; 

subplot (8,1,3) 

plot ([0:63],Areal_stage2,'-o'); 
title ('Stage2 real'); 

axis auto ; 


subplot (8,1,4) 

plot ([0:63],Aim stageZz, *=o*); 
title ('Stagez2 im’); 

axis auto ; 


subplot. (6,1, 5) 

plot ([0:63],Areal_stage3,'-o'); 
title ('Stage3 real'); 

axis auto ; 


Subpilor (3;,1,6) 

plot ((0763],Aim- stages, '=a').; 
title ('Stage3 im'); 

axis auto ; 


figure (2) 

X31(1:N)=sqgrt (Aim_stagel_r(1:N) .*2+Areal_stagel_r(1:N).%2); 

subplor (S,..b,2) 

plow (Fb, XSL, tx") 

ylabel ('Magnitude'); 

Grid 

title ('The magnitude result of each stage of my FFT desing in Matlab- 
This plot is for Magnitude of stagel'); 


X32 (1:N)=sqrt (Aim_stage2 (1:N) .*%2+Areal_stage2(1:N).%2); 
SubploL. (3;1,2) 

DlLoc. (Fiyx32,*—2*) 

Grid 

ylabel ('Magnitude') 

title ('Magnitude of stage2'); 
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X33 (1:N)=sqrt (Aim_stage3 (1:N) .%*%2+Areal_stage3(1:N).%2); 
subplot. (3,1,;3) 

plow (Fi, X33, "2" ) 

opaglee! 

ylabel ('Magnitude') 

title ('Magnitude of stage3'); 


66 saving the choosen equation into binary file named 
Input_from_Matlab_real.in & Input_from_Matlab_image.in 


% decimal to binary 

% Accepts decimal numbers between -1 < Number < 1 

sand transform them into (2bit integer and 30 bit floating) 

6 egquationA is the input decimal number variable. 

6 pinakas is the output binary number 

6 handles 64 decimal number- for the 64 point data of FFT real and 
imagery 


for number3=1:1:2 
if (number3==1) 
Realoo(1:64)=sample_signal_image (1:64); 
else 
Realoo(1:64)=sample_signal_real (1:64); 
end 


for ax=1:1:64 
if (Realoo (ax) ==1) 
pinakas{ax,1}=0; 
pinakas{ax,2}=1; 
Lor tr=5:12564 
pinakas{ax,tr}=0; 
end 
end 
if (Realoo (ax) ==0) 
for 1=1:1:64 
pinakas{ax,i}=0; 
telos=1; 
end 
end 
if (Realoo (ax) ==-1) 
pinakas{ax,1}=1; 
pinakas{ax,2}=1; 
ror tr—3t1:64 
pinakas{ax,tr}=0; 
end 
end 
if (Realoo (ax) <1) &(Realoo(ax)>0) 
pinakas{ax,1}=0; 
pinakas{ax,2}=0; 
zhtoymenos=Realoo (ax); 
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for tr=3:1:64 
zhtoymenoS=zhtoymenods*2; 
if (zhtoymeno5S>=1) 
zhtoymenos=zhtoymenoos-1; 
pinakas{ax,tr}=1; 
else 
pinakas{ax,tr}=0; 
end 
end 


end 
if (Realoo (ax) <0) & (Realoo (ax) >-1) 
pinakas{ax,1}=1; 
pinakas{ax,2}=1; 
zhtoymenoS=1+Realoo (ax); 
for tr=3:1:64 
zhtoymenos=zhtoymenods*2; 
if (zhtoymeno5S>=1) 
zhtoymenosS=zhtoymenoos-1; 
pinakas{ax,tr}=1; 


else 
pinakas{ax,tr}=0; 
end 
end 
end 
end 
Decimal_to_binary=[]; 


for 1=121764 
for e=1:1:32 

Decimal_to_binary=[Decimal_to_binary pinakas{i,e}]; sconcatenate 
end 


1f (number3==1) binary_result_to_Verilog_image = Decimal_to_binary ; 
6 contains the decimal result of our Verilog Design (image) 
else binary_result_to_Verilog_real = Decimal_to_binary ; ~@ Contains 
the decimal result of our Verilog Design (real) 
end 
end 
end 


fe) 


6 it stores the 32bit number in line without any ; or any other 

Gist incrio0n 

fid22 = fopen('Input_from_Matlab_image.in','w'); 

FOFIntt (fid2Z2,*Su', binary _result_to.Verilog_image) ; 

fclose (£f1d22) 

fid1l22 = fopen('Input_from_Matlab_real.in','w'); 

fprintf (f1d122, 'su',binary_result_to_Verilog_real)j; 

fclose(f1d122) 

66 Converting binary results from verilog modelsim files into decimal 
[decimal_result_of_Verilog_real, decimal_result_of_Verilog_image]=binary 
_to_decimal ('TB_data_file_Output_image.out', 'TB data_file Output_real.o 
et ,2°9) > 
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[decimal_result_of_Verilog_real_stagel,decimal_result_of_Verilog_image_ 
stagel]=binary_to_decimal ('TB_data_file_stagel_image.out', 'TB data_file 
_stagel_real.out',2%3); 
[decimal_result_of_Verilog_real_stage2,decimal_result_of_Verilog_image_ 
stage2]=binary_to_decimal ('TB data_file stage2_image.out', 'TB_data_file 
_stage2_real.out',2%6); 

6% figure that contains the fft and my Verilog fft 

figure (5); 

as(1:64)=0.0001; 

Xin_real=(sample_signal_real)j; 

subplot (4,1,1) 

plow {F1l,xXin real, =o") 

title ('Matlabs Input Function Real'); 

axis auto ; 


Xin_image=(sample_signal_image) ; 

subplot (4,1, 2) 

plot (Fl, Xin_ image, *—o") 

title ('Matlabs Input Function Image'); 

axis auto ; 

X2=abs (fft (sample_signal_real+ j*sample_signal_image,N)); 
Ssubplor (4,1,3) 

plow (Fily,X2Z,; *=x") 

title ('FFT N=64 - Matlabs Ready Function'); 

axis auto ; 


Result_Verilog(1:N)=double 
(sqrt (decimal_result_of_Verilog_real (1:N) .*2+decimal_result_of_Verilog_ 
image (1:N).%2)); 


subplot (4,1, 4) 
plot. (Fl1,Result_ Verilog, *=x*} 


grid 
title ("Radix 4 FFT N=64 VERILOG output before the proper reverse of 
the final stage - As it comes from the file'); 


axis auto; 


* bit reversal of the Radix4 
6 example 003-> 300 


ttt=0; 

e2=0; 

ror vzZ=Ll:1l:4 

for p2=0:4:12 
for 12=0:16:48 
ECL=—Ccc tL 


KZ (Cot) =HVv2r D2 FL Zs 
end 
end 
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end 

ter 12=—L¢1764 

Sif (Aim_stagel (1) <10%(-10) ) 
=0; 


% Aim_stagel_r(k22 (1) ) 
Selse 
koykoyi (k22 (1) )=decimal_result_of_Verilog_image (1); 


koykoyr (k22 (1) )=decimal_result_of_Verilog_real (1); 
end 


eel (1:64, 3) =koykoyrt+j*koykoyi; smetalaksh 
FFT_Verilog_Final_Magn(1:N)= sqrt (koykoyr(1:N) .*2+koykoyi(1:N).%2); 
FFT _Verilog_Final_Angle(1:N)=unwrap (angle (koykoyr+j*koykoyl) ); 


figure (12) 

plot (Fl,FFT_Verilog_Final_Magn, '-x"'); 

title ('My Verilog FFT Stage Final Output'); 
ylabel ('Magnitude'); 

axis auto ; 


2° 


66 Helpfull reversing of verilog files stage 2 


ttt1l=0; 
e21=0; 


for p21=0:16:48 

tor VZ2i=1-2.134 
ror 221=074 212 

toc l= crcl.» 


K221.(CCel)a=v2lepZiri2 lL: 
end 

end 

end 


for i1=1:1:64 

Sif (Aim_stagel (1) <10%(-10 
% Aim_stagel_r(k22(1))= 
selse 
Aim_stage2_rr(k221 (11) )=Aim_stage2 (11); 
Areal_stage2_rr(k221(11))=Areal_stage2 (il); 

end 

66 The magnitude result of each stage of my FFT desing in Matlab 
s6compared to the magnitude result of each stage from Verilog(after 
being 

Sradiated) (not be reversed) 


)) 
Us 


for qw=1:1:64 


decimal_result_of_Verilog_image_stagel_r (k22 (qw) )=decimal_result_of_Ver 
1log_image_stagel ((qw)); 
decimal_result_of_Verilog_real_stagel_r(k22 (qw) )=decimal_result_of_Veri 
log_real_stagel ((qw)); 

end 

6for qw=1:1:64 
6decimal_result_of_Verilog_image_stage2_r((qw) )=Aim_stage2 (k22 (qw) ); 
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6decimal_result_of_Verilog_real_stage2_r((qw) )=Areal_stage2 (k22 (qw) ); 
Send 


6for gqw=1:1:64 
6decimal_result_of_Verilog_image_r ( (qw) )=Aim_stage3 (k22 (qw) ); 
6decimal_result_of_Verilog_real_r((qw) )=Areal_stage3 (k22 (qw) ); 

Send 

figure (12) 

X31(1:N)=sqgrt (Aim_stagel_rr(1:N) .*2+Areal_stagel_rr(1:N).%2); 
AA21(1:N)=sqrt (decimal_result_of_Verilog_real_stagel_r(1:N) .*%2+decimal_ 
result_of_Verilog_image_stagel_r(1:N) .%2); 

SsubpLlOoG (3,1;,1) 

DPloc (Fil, xXoly*-s",F1l,4A21; %*ro* ) 

ylabel ('Magnitude'); 

op ase 

title ('The magnitude result of each stage of my FFT desing in Matlab 
compared to the magnitude result of each stage from Verilog(after being 
radiated) (not reversed)- This plot is for Magnitude of stagel'); 

legend ('FFT expected value', 'FFT Radiated Value'); 


X32 (1:N)=sqgrt (Aim_stage2_rr(1:N) .*2+Areal_stage2_rr(1:N).%2); 

AA22 (1:N)=sqrt (decimal_result_of_Verilog_real_stage2 (1:N) .*2+decimal_re 
sult_of_Verilog_image_stage2(1:N).%2); 

subplot (3, 1,2) 

Plo AP 1,432; *—e*,b1l,2A22, 72° ) 

grid 

ylabel ('Magnitude') 

title ('Magnitude of stage2'); 


X33 (1:N)=sqrt (Aim_stage3 (1:N).%*%2+Areal_stage3(1:N).%2); 

AA23 (1:N)=sqrt (decimal_result_of_Verilog_image (1:N) .*2+decimal_result_o 
f Verilog_real(1:N).%2); 

subolor (371,3) 

plow (F1i,X%X33, *—2',;F1il,AAZ3, *ro") 

grid 

ylabel ('Magnitude') 

title ('Magnitude of stage3'); 


figure (14) 

secOolorder black 
difference_stagel_values=X31 (1:N) -AA21(1:N); 
Subplor (3,1,1) 

plot (F1l,difference_stagel_values, '-x') 


hold of 

DlLOU (Ply 2e(=-24.,.5)¢ ee" py Ply R224.) pte) 

hold ofr 

ylabel ('Difference in stagel values of figure 12'); 
Grid 


legend ('Difference', 'Non Radiated Expected Limits'); 
difference_stage2_values=X32 (1:N) -AA22 (1:N); 

SubpLOL. (3),1,2) 

plot (F1l,difference_stage2_values, '-x') 

hold on 

Dloe (Pilg 2° (21.5) y pd pa a 21) ye RE) 

hold @rr 
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grid 

ylabel ('Difference in stage2 values of figure 12'); 
difference_stage3_values=X33 (1:N) -AA23(1:N); 
SubpioLr. (3p1y3) 

plot (F1l,difference_stage3_values, '-x') 

hold. on 

Dloe (Pilg 2° (10.5) » Ep pa aH 1) ye ee) 

hold Orr 

Grid 

ylabel ('Difference in stage3 values of figure 12'); 


figure (15) 

6%colordef black 
difference_stagel_values=X31 (1:N) -AA21(1:N); 
subplot (3,1, 1) 

plot (F1l,difference_stagel_values, '-x') 


hold on 

Plow (Ply 2° =6.0)., 7 ee ply 2 (RH 6) ED 

hold Gift 

ylabel ('Difference in stagel values of figure 12'); 
grid 


legend ('Difference', 'Radiated Expected Limits'); 
difference_stage2_values=X32 (1:N) -AA22 (1:N); 
Ssubpbor (3,1;2) 

plot (F1l,difference_stage2_values, '-x') 

hold on 

DlLor. (Piy2° 43.0) 4 °bE*, Pilg 2" (3.5); te) 

hold 6ri 

grid 

ylabel ('Difference in stage2 values of figure 12'); 
difference_stage3_values=X33 (1:N) -AA23 (1:N); 
subplot .(o,41,73) 

plot (F1l,difference_stage3_values, '-x') 

hold on 

plow (Pi, 2°(- 25): "Fn? -PlyH2 (2.35), Fe") 

hold 6Ltr 

Grid 

ylabel ('Difference in stage3 values of figure 12'); 


o\ 


66 Checking the Phase 

my_Matlab_fft (1:N)=sqrt (Aim_stage3_r(1:N).*%2+Areal_stage3_r(1:N) .%2); 
smetalaksh 

Angle_of_my_Matlab_fft= 

unwrap (angle (Areal_stage3_r+ ]*Aim_stage3_r));smetalaksh 
fEt_build_in_matlab=abs (fft (sample_signal_real+j*sample_signal_image,N) 
); 

Angle_of_build_in_fft= 

unwrap (angle (fft (sample_signal_real+j*sample_signal_image,N))),; 

eel (1:64,1)=fft (sample_signal_real(1:64)+j*sample_signal_image(1:64)),; 
eel (1:64, 2) =Areal_stage3_r+ ]*Aim_stage3_r; smetalaksh 

pinakasAl=real (eel(:,1)); 

pinakasA2=imag (eel(:,1)); 

pinakasBl=real (eel(:,3)) 
pinakasB2=imag (eel(:,3)); 
Check_phasel=abs (pinakasAl-pinakasBl)j; 
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Check_phase2=abs (pinakasA2-pinakasB2) ; 


figure (9) 

plot (real(eel(:,1)),[1:64], 'xr',real (eel (:,3)),[1:64],'ob'); 
title ('Real results Difference'); 

figure (10) 

plot (imag(eel(:,1)),[1:64], 'xr',1imag(eel(:,3)),[1:64],'ob'); 
title ('Image results Difference'); 

figure (3) 


subplot (6,1, 1) 
plot (F1,FEI_ Verilog Final Magn, *=x") ; 
title ("Radix 4 FFT N=64 —- My Verilog FFT'); 
ylabel ('Magnitude'); 
axis auto ; 
subplot (6,1,2) 
plot (Fl,;my_Metleb_fft, *—x") 
panne 
title ("Radix 4 FFT N=64 my Matlab design'); 
ylabel ('Magnitude') 
subplot (6,1,4) 
plot (Fl,FFT_Verilog_Final_Angle,'-x'); 
title ("Angle of the my Verilog fft '); 
ylabel ('Phase') 
subplot (6,1,5) 
plot (Fl,Angle_of_my_Matlab_fft, '-x'); 
title (‘Angle of my Matlab fft'); 
ylabel ('Phase') 
Subpior (O;d,»5) 
Dlouw. (Pi,tic Duda 2n aaclab,”— 3"); 
title ('FFT N=64 -—- Matlabs Build In Function'); 
ylabel ('Magnitude'); 
axis auto ; 
Subpior (6,1,:6) 
Dion (Fl1,Angle or burid.an tru, "= 3"); 
title ('Angle of the build in fft '); 
ylabel ('Phase') 
scomparing My Matlab Function FFT with FFT Verilog (comparing 
S6the magnitude and the phase) 
figure (4) 
ror 1=1:1:64 
ar (my Matlab tre<i0 12) 
end 
end 
Amagn=abs (my_Matlab_fft-FFT_Verilog_Final_Magn) ; 
Aphase=abs (Angle_of_my_Matlab_fft—-FFT_Verilog_Final_Angle); 
1f (single (Check_phasel) <=(2%-1))& (single (Check_phase2) <=(2%- 
1))%metalaksh 
Phase_and_Magnitude_Checking='Correct !!!' 
else 
Phase_and_Magnitude_Checking='Failed !!' 
end 
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plot (F1l,Amagn, '+q’,Fl,Aphase,*or’); 

title ('Difference in Magnitude or in Phase between build in FFT and my 
Matlab FFT '); 

axis auto ; 


Saxis ([0 1 0 (10%(-8))]);%metalaksh 
xlabel ('N=64"') 


( 
ylabel ('Absolute value') 
legend('Magnitude Difference', 'Phase Difference'); 
figure (13) 
Ssubplot(3,2Z,1L),pilot (FPl,xian_real, *=—o*) 
title ('Matlabs Input Function Real'); 
axis auto ; 
subplot (3,2,;2)yplout (FlyxXin_amage; *’=—o") 
title ('Matlabs Input Function Image'); 
axis auto ; 
Subplom. (35-2, |5 6]) 
difference_in_output_a=fft_build_in_matlab-FFT_Verilog_Final_Magn; 
plot (F1l,difference_in_output_a,'!xr'); 
ylabel ('Detailed Difference in Magnitude of the Output') 


Ssubpior (3,2, [3 41) 
plot (F1l,fit_build_in_matlab,*—-gqs",Fl,FFI_Verilog_Final_Magn,*==—+*); 
ylabel ('Magnitude of the Output') 
xlabel ('N=64 samples') 
legend ('fft build in matlab', 'FFT Verilog Final Magn (Post-Route 
Simulation) '); 
hold on 
for 1=1:1:64 
1 (abs (difference_in_output_a(i))>3.8*10% (-6) ) 
plot ((1i-1)/64,FFT_Verilog_Final_Magn(i),'ro') 
legend ('fft build in matlab', 'FFT Verilog Final Magn (Post- 
Route Simulation)', 'Identified Errors'); 
end 
end 
hold ori 


Real_diff=Check_phasel (1); 
Image_diff=Check_phase2 (1); 
for 1=1:1:64 
if (Real_diff<Check_phasel (1) ) 
Real_diff=Check_phasel (1); 
end 
if (Image_diff<Check_phase2 (i) ) 
Image_diff=Check_phase2 (1); 
end 


end 

if (Real_diff>Image_diff) 
Major_diff=Real_diff; 

else 
Major_diff=Image_diff; 

end 
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if (single (Major _diff)<=(2"(-1))) 
title ('FFT Comparing Check =Correct! Your Verilog FFT works at 16/7 


MHz"); 
else 
title ('FFT Comparing Check =Wrong! Your Verilog FFT does not 
work.');> 
end 


Major diff 


6 binary to decimal 

% Accepts binary numbers between 01.000..00 - 11.111...11 

6(2Z2bit integer and 30 bit floating) 

6 ap is the input binary number variable. (be careful, there is no 
decimal 

6 point in the input binary numbers! 

6 resultl is the output decimal number 


fe) 


% the following part loads the 32bit number from Verilog that are in 
line without any ; 

6 or any other distinction (only a space between the numbers) 

S& "transformed_bin_to_ dec" is a 32bit 64 cell matrix that contains the 
Ditary 

% output of verilog simulation file 

Lune ELON 

[decimal_result_of_Verilog_real, decimal_result_of_Verilog_image]=binary 
_to_decimal (TB_data_file_Output_image, TB_data_file_Output_real,multipl) 


btd = fopen(TB_data_file Output_image,'r'); 
Binary_to_decimal_image=fscanf (btd, '%sc'); 
sbinary tO. decimal—rread(btd, **uint"’);s[3z, 32), **uint"); 
6a33=Binary_to_decimal.'; 
fclose (btd) 
tor 1k: 1264 
6for egqq=0:1:31 
[token, Binary_to_decimal_image] = strtok (Binary_to_decimal_image) ; 
Stransformed_bin_ to _dec{1,1:32}=sscanf (token, '%Sc'); 
transformed_bin_to_dec_image{i}=token; 
Sbinary oOo cecinal, 1,1¢32}; 
6 transformed_bin_to_dec(1i)=Binary_to_decimal (1+31*egq:32+31*egq) ; 


btdl = fopen(TB_data_file Output_real,'r'); 
Binary_to_decimal_real=fscanf (btdl, '%c"'); 


fclose (btdl) 

for 1=1:1:64 

S6for eqq=0:1:31 
[token, Binary_to_decimal_real] = strtok (Binary_to_decimal_real)j; 
Stransformed_bin to _dec{1,1:32}=sscanf (token, '%Sc'); 
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transformed_bin_to_dec_real{i}=token; 
Sbinary to. cecimal{i, 1:32}; 


for number2Z=1:1:2 
for oeo=1:1:64 


if (number2==1) 


6 transformed_bin_to_dec(1i)=Binary_to_decimal (1+31*egqq:32+31*egq) ; 


ap=transformed_bin_to_dec_image{oeo}; 


else 


ap=transformed_bin_to_dec_real{oeo}; 


end 


sap="00111000000000000000000000000000'; 
cap OOUQ00LUOULOOUO0LIOOLTOLIOOLOLOOLO’; 


Je) 


inzdesc (ap{3:32})/2°30 


zhtoymenol=1; 


ar (ap(le2)=="O1") 
resullti=—1; 
end 
1£ (ap(1:2)=='00') 
result1=0; 
for e321. 352 
1f (ap(el))=='1' 
numl=1; 
else 
numl=0; 
end 
zhtoymenol=(zhtoymenol) /2; 
resultl=numl*zhtoymenol+resultl; 
end 
end 
if (apt l<2)—="*i1") 
result l==L- 
for €1=32:1-32 
1f (ap(el))=='1' 
numl=1; 
else 
numl=0; 
end 
zhtoymenol=(zhtoymenol) /2; 
resultl=resultl+numl*zhtoymenol; 
end 
end 
6 in the following calculation I multiply 
io 


the final result by 2%9 due 


6 3bit step down on each stage of the FFT =>(2%3) * (2%3) * (2%3)=2°9 
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1f (number2==1) decimal_result_of_Verilog_image (oeo0)= resultl*multipl 
, &© contains the decimal result of our Verilog Design (image) 

else decimal_result_of_Verilog_real (oeo0)= resulti*multipl . 2 
contains the decimal result of our Verilog Design (real) 


end 
end 


end 
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